class

AdamW

extendsOptimizer
AdamW(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False)
source

Adam optimizer with decoupled weight decay regularisation.

AdamW fixes the weight-decay coupling present in standard Adam by applying the decay directly to the parameters rather than adding it to the gradient. The update rule is:

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m^t=mt1β1tv^t=vt1β2tθt=θt1αm^tv^t+ϵαλθt1\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\ \theta_t &= \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \alpha \lambda \theta_{t-1} \end{aligned}

The final term αλθt1-\alpha \lambda \theta_{t-1} is the decoupled weight decay; it is applied after the adaptive gradient step, not mixed into gtg_t.

Parameters

paramsiterable of Parameter or iterable of dict
Parameters to optimise, or a list of parameter-group dicts.
lrfloat= 0.001
Learning rate α\alpha (default: 1e-3).
betastuple of float= (0.9, 0.999)
Coefficients (β1,β2)(\beta_1, \beta_2) for computing running averages of the gradient and its square (default: (0.9, 0.999)).
epsfloat= 1e-08
Term ϵ\epsilon added to the denominator for numerical stability (default: 1e-8).
weight_decayfloat= 0.01
Decoupled weight decay coefficient λ\lambda (default: 1e-2).
amsgradbool= False
Whether to use the AMSGrad variant that maintains the maximum of past squared gradients (default: False).

Attributes

param_groupslist of dict
Parameter groups, each containing "params", "lr", "beta1", "beta2", "eps", "weight_decay", and "amsgrad".
defaultsdict
Default hyperparameter values.

Notes

Decoupled weight decay makes the effective regularisation independent of the learning rate, which simplifies hyperparameter tuning. AdamW is the recommended default optimizer for transformer-based models and generally outperforms Adam with L2 regularisation.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.AdamW(
...     model.parameters(), lr=1e-4, weight_decay=1e-2
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False)
source

Initialise the AdamW. See the class docstring for parameter semantics.

fn

step

Tensor or None
step(closure: _OptimizerClosure = None)
source

Perform a single AdamW optimisation step.

Calls the engine-level AdamW update for each parameter group, which applies the adaptive gradient update followed by decoupled weight decay directly on the parameters.

Parameters

closurecallable= None
A closure that re-evaluates the model and returns the loss. If provided, it is called before the parameter update and its return value is passed back to the caller.

Returns

Tensor or None

The loss returned by closure, or None if no closure was provided.

Examples

>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()