class

RAdam

extendsOptimizer
RAdam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)
source

Rectified Adam optimizer with variance-adapted step size.

RAdam addresses the large variance in Adam's effective learning rate during the early training steps (when the moving averages are poorly initialised) by computing a rectification term that smoothly transitions between SGD and Adam.

The maximum length of the approximated SMA is:

ρ=21β21\rho_\infty = \frac{2}{1 - \beta_2} - 1

At step tt the current SMA length estimate is:

ρt=ρ2tβ2t1β2t\rho_t = \rho_\infty - \frac{2 t \, \beta_2^t}{1 - \beta_2^t}

When ρt>4\rho_t > 4 the variance is tractable and a rectified adaptive step is used:

rt=(ρt4)(ρt2)ρ(ρ4)(ρ2)ρtθt=θt1αrtm^tv^t+ϵ\begin{aligned} r_t &= \sqrt{ \frac{(\rho_t - 4)(\rho_t - 2)\rho_\infty} {(\rho_\infty - 4)(\rho_\infty - 2)\rho_t}} \\ \theta_t &= \theta_{t-1} - \alpha \, r_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned}

Otherwise the update falls back to SGD with bias-corrected momentum.

Parameters

paramsiterable of Parameter or iterable of dict
Parameters to optimise, or a list of parameter-group dicts.
lrfloat= 0.001
Learning rate α\alpha (default: 1e-3).
betastuple of float= (0.9, 0.999)
Coefficients (β1,β2)(\beta_1, \beta_2) for the first- and second-moment estimates (default: (0.9, 0.999)).
epsfloat= 1e-08
Term ϵ\epsilon for numerical stability (default: 1e-8).
weight_decayfloat= 0
L2 regularisation coefficient (default: 0).

Attributes

param_groupslist of dict
Parameter groups with keys "params", "lr", "beta1", "beta2", "eps", and "weight_decay".
defaultsdict
Default hyperparameter values.

Notes

RAdam removes the need for a warmup schedule by automatically stabilising the adaptive learning rate in the early stages of training. It is a drop-in replacement for Adam that is less sensitive to the choice of learning rate.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.RAdam(model.parameters(), lr=1e-3)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)
source

Initialise the RAdam. See the class docstring for parameter semantics.

fn

step

Tensor | None
step(closure: _OptimizerClosure = None)
source

Perform a single RAdam step.