class

Adam

extendsOptimizer
Adam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False)
source

Adaptive Moment Estimation optimizer (Kingma & Ba, 2015).

Combines the benefits of two earlier adaptive methods — AdaGrad's per-parameter learning rates derived from the running history of gradients, and RMSProp's exponential moving average of squared gradients — by maintaining two moment estimates and applying bias correction to compensate for their zero initialisation. The result is a near-parameterless optimiser that works well across a remarkably wide range of architectures and is the de-facto default for deep learning training.

Parameters

paramsiterable of Parameter
Iterable of parameters to optimise, or dicts defining parameter groups with their own per-group hyperparameters.
lrfloat= 0.001
Learning rate α\alpha (default: 1e-3).
betastuple of float= (0.9, 0.999)
Decay rates (β1,β2)(\beta_1, \beta_2) for the first and second moment running averages (default: (0.9, 0.999)).
epsfloat= 1e-08
Term ε\varepsilon added to the denominator for numerical stability (default: 1e-8).
weight_decayfloat= 0
L2L_2 penalty coefficient. Note that Adam folds this directly into the gradient before the moment update, which couples it with the adaptive learning rate; prefer AdamW for properly decoupled weight decay.
amsgradbool= False
Whether to use the AMSGrad variant (Reddi et al., 2018), which keeps the running max of the second moment to guarantee convergence under non-convex settings (default: False).

Notes

The update rule for parameter θ\theta with gradient gtg_t at step tt is:

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m^t=mt/(1β1t)v^t=vt/(1β2t)θt=θt1αm^t/(v^t+ε)\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= m_t / (1 - \beta_1^t) \\ \hat{v}_t &= v_t / (1 - \beta_2^t) \\ \theta_t &= \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \varepsilon) \end{aligned}

where mtm_t is the running first moment (mean of gradients), vtv_t is the running uncentered second moment (mean of squared gradients), and the hat-quantities apply bias correction so that E[m^t]=E[gt]\mathbb{E}[\hat{m}_t] = \mathbb{E}[g_t] even at small tt. The effective per-parameter learning rate is α/(v^t+ε)\alpha / (\sqrt{\hat{v}_t} + \varepsilon) — small for high-variance gradients, large for stable ones.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adam(model.parameters(), lr=1e-3)
>>> for x, y in dataloader:
...     optimizer.zero_grad()
...     loss = loss_fn(model(x), y)
...     loss.backward()
...     optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False)
source

Initialise the Adam. See the class docstring for parameter semantics.

fn

step

Tensor or None
step(closure: _OptimizerClosure = None)
source

Perform a single Adam optimisation step.

Calls the engine-level Adam update for each parameter group, which applies the bias-corrected first- and second-moment update rule.

Parameters

closurecallable= None
A closure that re-evaluates the model and returns the loss. If provided, it is called before the parameter update and its return value is passed back to the caller.

Returns

Tensor or None

The loss returned by closure, or None if no closure was provided.

Examples

>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()