RAdam
OptimizerRAdam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)Rectified Adam optimizer with variance-adapted step size.
RAdam addresses the large variance in Adam's effective learning rate during the early training steps (when the moving averages are poorly initialised) by computing a rectification term that smoothly transitions between SGD and Adam.
The maximum length of the approximated SMA is:
At step the current SMA length estimate is:
When the variance is tractable and a rectified adaptive step is used:
Otherwise the update falls back to SGD with bias-corrected momentum.
Parameters
paramsiterable of Parameter or iterable of dictlrfloat= 0.0011e-3).betastuple of float= (0.9, 0.999)(0.9, 0.999)).epsfloat= 1e-081e-8).weight_decayfloat= 00).Attributes
param_groupslist of dict"params", "lr", "beta1",
"beta2", "eps", and "weight_decay".defaultsdictNotes
RAdam removes the need for a warmup schedule by automatically stabilising the adaptive learning rate in the early stages of training. It is a drop-in replacement for Adam that is less sensitive to the choice of learning rate.
Examples
>>> import lucid.optim as optim
>>> optimizer = optim.RAdam(model.parameters(), lr=1e-3)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()Methods (2)
__init__
→None__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)Initialise the RAdam. See the class docstring for parameter semantics.
step
→Tensor | Nonestep(closure: _OptimizerClosure = None)Perform a single RAdam step.