class

Adamax

extendsOptimizer

Adamax(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)

source

Adamax optimizer — a variant of Adam based on the infinity norm.

Adamax generalises Adam by using the $\ell_\infty$ norm instead of the $\ell_2$ norm for the second-moment estimate. The update rule replaces $v_t$ with the element-wise maximum of past absolute gradients scaled by $\beta_2$ :

\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ u_t &= \max\!\left(\beta_2 \, u_{t-1},\; |g_t|\right) \\ \theta_t &= \theta_{t-1} - \frac{\eta}{(1 - \beta_1^t)} \cdot \frac{m_t}{u_t + \epsilon} \end{aligned}

Because $u_t$ is bounded by $\max_k \beta_2^k |g_{t-k}|$ , the effective step size is naturally bounded.

Parameters

paramsiterable of Parameter or iterable of dict

Parameters to optimise, or a list of parameter-group dicts.

lrfloat= 0.002

Learning rate

\eta

(default: 2e-3).

betastuple of float= (0.9, 0.999)

Coefficients

(\beta_1, \beta_2)

for the first-moment estimate and the

\ell_\infty

norm decay (default: (0.9, 0.999)).

epsfloat= 1e-08

Term

\epsilon

added to the denominator for numerical stability (default: 1e-8).

weight_decayfloat= 0

L2 regularisation coefficient (default: 0).

Attributes

param_groupslist of dict

Parameter groups with keys "params", "lr", "beta1", "beta2", "eps", and "weight_decay".

defaultsdict

Default hyperparameter values.

Notes

Adamax can be more stable than Adam on problems where gradients are sparse or have large outliers, because the infinity norm is less sensitive to large individual gradient magnitudes than the L2 norm.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adamax(model.parameters(), lr=2e-3)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0)

source

Initialise the Adamax. See the class docstring for parameter semantics.

step

→Tensor | None

step(closure: _OptimizerClosure = None)

source

Perform a single Adamax step.