clip_grad_norm_

→Tensor

clip_grad_norm_(parameters: Iterable[Parameter], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False)

source

Clip the global gradient norm of parameters in place.

Rescales every gradient so that the total $\ell_p$ norm — computed across all parameters jointly, as if they were one long concatenated vector — is at most max_norm. A staple of stable Transformer / RNN training: prevents the occasional huge gradient from derailing optimisation.

Parameters

parametersiterable of Parameter

Parameters whose .grad should be clipped. Entries with grad is None are silently skipped.

max_normfloat

Maximum allowed norm of the combined gradient vector. The scaling factor never exceeds 1 — gradients smaller than max_norm are untouched.

norm_typefloat= 2.0

Order

p

of the norm. Default 2.0 (Euclidean). Pass math.inf for the max-norm (element-wise absolute maximum across all gradients).

error_if_nonfinitebool= False

If True, raise RuntimeError when the computed total norm is inf or nan instead of silently scaling by a non-finite coefficient.

Returns

Tensor

Scalar tensor holding the pre-clipping total norm. Useful for logging the gradient magnitude during training even when no actual clipping took place.

Notes

With combined norm $\|g\|_p = \left(\sum_i |g_i|^p\right)^{1/p}$ taken over every element of every gradient, the update is

g \;\mapsto\; g \cdot \min\!\left(1,\, \frac{\text{max\_norm}}{\|g\|_p + \epsilon}\right),

where the $\epsilon = 10^{-6}$ guards against division by zero when all gradients vanish. Because every parameter is scaled by the same coefficient the direction of the global update is preserved — only its magnitude is bounded.

Examples

>>> import lucid
>>> from lucid.nn.utils import clip_grad_norm_
>>> # after loss.backward() ...
>>> total_norm = clip_grad_norm_(model.parameters(), max_norm=1.0)