class

RAdam

extendsOptimizer

RAdam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)

source

Rectified Adam optimizer with variance-adapted step size.

RAdam addresses the large variance in Adam's effective learning rate during the early training steps (when the moving averages are poorly initialised) by computing a rectification term that smoothly transitions between SGD and Adam.

The maximum length of the approximated SMA is:

\rho_\infty = \frac{2}{1 - \beta_2} - 1

At step $t$ the current SMA length estimate is:

\rho_t = \rho_\infty - \frac{2 t \, \beta_2^t}{1 - \beta_2^t}

When $\rho_t > 4$ the variance is tractable and a rectified adaptive step is used:

\begin{aligned} r_t &= \sqrt{ \frac{(\rho_t - 4)(\rho_t - 2)\rho_\infty} {(\rho_\infty - 4)(\rho_\infty - 2)\rho_t}} \\ \theta_t &= \theta_{t-1} - \alpha \, r_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned}

Otherwise the update falls back to SGD with bias-corrected momentum.

Parameters

paramsiterable of Parameter or iterable of dict

Parameters to optimise, or a list of parameter-group dicts.

lrfloat= 0.001

Learning rate

\alpha

(default: 1e-3).

betastuple of float= (0.9, 0.999)

Coefficients

(\beta_1, \beta_2)

for the first- and second-moment estimates (default: (0.9, 0.999)).

epsfloat= 1e-08

Term

\epsilon

for numerical stability (default: 1e-8).

weight_decayfloat= 0

L2 regularisation coefficient (default: 0).

Attributes

param_groupslist of dict

Parameter groups with keys "params", "lr", "beta1", "beta2", "eps", and "weight_decay".

defaultsdict

Default hyperparameter values.

Notes

RAdam removes the need for a warmup schedule by automatically stabilising the adaptive learning rate in the early stages of training. It is a drop-in replacement for Adam that is less sensitive to the choice of learning rate.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.RAdam(model.parameters(), lr=1e-3)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)

source

Initialise the RAdam. See the class docstring for parameter semantics.

step

→Tensor | None

step(closure: _OptimizerClosure = None)

source

Perform a single RAdam step.