class

Adamax

extendsOptimizer
Adamax(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)
source

Adamax optimizer — a variant of Adam based on the infinity norm.

Adamax generalises Adam by using the \ell_\infty norm instead of the 2\ell_2 norm for the second-moment estimate. The update rule replaces vtv_t with the element-wise maximum of past absolute gradients scaled by β2\beta_2:

mt=β1mt1+(1β1)gtut=max ⁣(β2ut1,  gt)θt=θt1η(1β1t)mtut+ϵ\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ u_t &= \max\!\left(\beta_2 \, u_{t-1},\; |g_t|\right) \\ \theta_t &= \theta_{t-1} - \frac{\eta}{(1 - \beta_1^t)} \cdot \frac{m_t}{u_t + \epsilon} \end{aligned}

Because utu_t is bounded by maxkβ2kgtk\max_k \beta_2^k |g_{t-k}|, the effective step size is naturally bounded.

Parameters

paramsiterable of Parameter or iterable of dict
Parameters to optimise, or a list of parameter-group dicts.
lrfloat= 0.002
Learning rate η\eta (default: 2e-3).
betastuple of float= (0.9, 0.999)
Coefficients (β1,β2)(\beta_1, \beta_2) for the first-moment estimate and the \ell_\infty norm decay (default: (0.9, 0.999)).
epsfloat= 1e-08
Term ϵ\epsilon added to the denominator for numerical stability (default: 1e-8).
weight_decayfloat= 0
L2 regularisation coefficient (default: 0).

Attributes

param_groupslist of dict
Parameter groups with keys "params", "lr", "beta1", "beta2", "eps", and "weight_decay".
defaultsdict
Default hyperparameter values.

Notes

Adamax can be more stable than Adam on problems where gradients are sparse or have large outliers, because the infinity norm is less sensitive to large individual gradient magnitudes than the L2 norm.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adamax(model.parameters(), lr=2e-3)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)
source

Initialise the Adamax. See the class docstring for parameter semantics.

fn

step

Tensor | None
step(closure: _OptimizerClosure = None)
source

Perform a single Adamax step.