class

Adam

extendsOptimizer

Adam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False)

source

Adaptive Moment Estimation optimizer (Kingma & Ba, 2015).

Combines the benefits of two earlier adaptive methods — AdaGrad's per-parameter learning rates derived from the running history of gradients, and RMSProp's exponential moving average of squared gradients — by maintaining two moment estimates and applying bias correction to compensate for their zero initialisation. The result is a near-parameterless optimiser that works well across a remarkably wide range of architectures and is the de-facto default for deep learning training.

Parameters

paramsiterable of Parameter

Iterable of parameters to optimise, or dicts defining parameter groups with their own per-group hyperparameters.

lrfloat= 0.001

Learning rate

\alpha

(default: 1e-3).

betastuple of float= (0.9, 0.999)

Decay rates

(\beta_1, \beta_2)

for the first and second moment running averages (default: (0.9, 0.999)).

epsfloat= 1e-08

Term

\varepsilon

added to the denominator for numerical stability (default: 1e-8).

weight_decayfloat= 0

L_2

penalty coefficient. Note that Adam folds this directly into the gradient before the moment update, which couples it with the adaptive learning rate; prefer AdamW for properly decoupled weight decay.

amsgradbool= False

Whether to use the AMSGrad variant (Reddi et al., 2018), which keeps the running max of the second moment to guarantee convergence under non-convex settings (default: False).

Notes

The update rule for parameter $\theta$ with gradient $g_t$ at step $t$ is:

\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= m_t / (1 - \beta_1^t) \\ \hat{v}_t &= v_t / (1 - \beta_2^t) \\ \theta_t &= \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \varepsilon) \end{aligned}

where $m_t$ is the running first moment (mean of gradients), $v_t$ is the running uncentered second moment (mean of squared gradients), and the hat-quantities apply bias correction so that $\mathbb{E}[\hat{m}_t] = \mathbb{E}[g_t]$ even at small $t$ . The effective per-parameter learning rate is $\alpha / (\sqrt{\hat{v}_t} + \varepsilon)$ — small for high-variance gradients, large for stable ones.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adam(model.parameters(), lr=1e-3)
>>> for x, y in dataloader:
...     optimizer.zero_grad()
...     loss = loss_fn(model(x), y)
...     loss.backward()
...     optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False)

source

Initialise the Adam. See the class docstring for parameter semantics.

step

→Tensor or None

step(closure: _OptimizerClosure = None)

source

Perform a single Adam optimisation step.

Calls the engine-level Adam update for each parameter group, which applies the bias-corrected first- and second-moment update rule.

Parameters

closurecallable= None

A closure that re-evaluates the model and returns the loss. If provided, it is called before the parameter update and its return value is passed back to the caller.

Returns

Tensor or None

The loss returned by closure, or None if no closure was provided.

Examples

>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()