class

GradScaler

GradScaler(init_scale: float = 2.0 ** 16, growth_factor: float = 2.0, backoff_factor: float = 0.5, growth_interval: int = 2000, enabled: bool = True)

source edit

Dynamic loss-scaling helper for mixed-precision training.

Mixed-precision training keeps activations and weights in fp16 to halve memory bandwidth and exploit fp16-fast hardware paths, but fp16's narrow dynamic range causes small gradients to underflow to zero — the network stops learning. GradScaler works around this by multiplying the loss by a large constant $s$ before backpropagation:

\tilde{L} = s \cdot L, \qquad \frac{\partial \tilde{L}}{\partial \theta} = s \cdot \frac{\partial L}{\partial \theta}.

The scaled gradients sit comfortably inside fp16's representable range; before the optimizer step they are unscaled by $1/s$ in fp32 so the update is mathematically equivalent to ordinary training.

The scale itself is adapted dynamically. After every step the unscaled gradients are checked for inf / NaN:

Overflow detected — the step is skipped and $s$ is multiplied by backoff_factor (typically 0.5).
No overflow for growth_interval consecutive steps — $s$ is multiplied by growth_factor (typically 2.0).

This produces a sawtooth schedule that tracks the largest scale the current gradient distribution can tolerate.

Parameters

init_scalefloat= 2**16

Initial loss scaling factor applied by scale.

growth_factorfloat= 2.0

Multiplier applied to the scale after growth_interval consecutive non-overflowing steps. Must be > 1.0.

backoff_factorfloat= 0.5

Multiplier applied when an inf / NaN gradient is detected. Must be in (0, 1).

growth_intervalint= 2000

Number of overflow-free steps required before the scale grows.

enabledbool= True

When False the scaler degenerates into a transparent pass-through — scale returns its input unchanged, step calls the optimizer directly, and update is a no-op.

Notes

The canonical training-loop pattern is scale-loss, then step, then update:

scale multiplies the loss by $s$ before backward() so the gradients land safely inside fp16 range.
step unscales the gradients, checks for inf / NaN, and either runs optimizer.step() or skips the update.
update adjusts $s$ according to the growth / backoff schedule for the next iteration.

Examples

>>> scaler = GradScaler()
>>> for x, y in dataloader:
...     with autocast():
...         out = model(x)
...         loss = loss_fn(out, y)
...     scaler.scale(loss).backward()
...     scaler.step(optimizer)
...     scaler.update()

Used by 2

Constructors

dunder

init

→None

__init__(init_scale: float = 2.0 ** 16, growth_factor: float = 2.0, backoff_factor: float = 0.5, growth_interval: int = 2000, enabled: bool = True)

source edit

Initialize the scaler state.

Parameters

init_scalefloat= 2**16

Initial loss scaling factor applied by scale.

growth_factorfloat= 2.0

Multiplier applied to the scale after growth_interval consecutive non-overflowing steps.

backoff_factorfloat= 0.5

Multiplier applied when an inf/NaN gradient is detected.

growth_intervalint= 2000

Number of overflow-free steps required before the scale grows.

enabledbool= True

When False the scaler is a transparent pass-through.

Instance methods

get_scale

→float

get_scale()

source edit

Return the current scale factor.

load_state_dict

→None

load_state_dict(state_dict: dict[str, float])

source edit

Load state from a dict.

scale

→Tensor | list[Tensor]

scale(outputs: Tensor | list[Tensor])

source edit

Multiply outputs by the current scale factor.

Parameters

outputsTensor | list[Tensor]

A Tensor or list of Tensors to scale.

Returns

Tensor | list[Tensor]

Scaled Tensor(s) — same structure as input.

state_dict

→dict[str, float]

state_dict()

source edit

Return serializable state dict.

step

→Tensor | None

step(optimizer: Optimizer, args: object = (), kwargs: object = {})

source edit

Unscale gradients and call optimizer.step() if no inf/nan detected.

If inf/nan is detected in gradients, skip the optimizer step.

Parameters

optimizerOptimizer

The optimizer to step.

Returns

Tensor | None

The return value of optimizer.step(), or None if step was skipped.

update

→None

update(new_scale: float | None = None)

source edit

Update the scale factor.

If a scale is provided, it is set directly. Otherwise, the scale is grown if no overflow was found for growth_interval steps, or reduced if overflow was found.

Parameters

new_scalefloat | None= None

Explicit new scale value (optional).

In-place ops

unscale_

→None

unscale_(optimizer: Optimizer)

source edit

Divide gradients by the current scale in-place.

Should be called before gradient clipping.

Args: optimizer: The optimizer whose parameters' grads will be unscaled.

Notes

The inverse-scale coefficient is always built in float32 even when the gradient is float16. At init_scale=2**16=65536 the unscale factor is 1/65536 ≈ 1.526e-5, which is subnormal in float16 (the smallest normal F16 value is 6.1e-5). Apple Silicon's Metal backend flushes F16 subnormals to zero, so a naive full(shape, inv_scale, F16, ...) coefficient becomes the zero tensor and every unscaled gradient collapses to 0 → the model stops learning even though the wall-clock looks great. Casting the gradient to F32 first (via MLX's automatic promotion on mixed-dtype multiply) keeps the unscale exact and also gives the optimizer F32 gradients to update the F32 parameter slots with — matching the reference framework's AMP path.

>>> scaler = GradScaler() >>> for x, y in dataloader: ... with autocast(): ... out = model(x) ... loss = loss_fn(out, y) ... scaler.scale(loss).backward() ... scaler.step(optimizer) ... scaler.update()