class

AdamW

extendsOptimizer

AdamW(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False)

source

Adam optimizer with decoupled weight decay regularisation.

AdamW fixes the weight-decay coupling present in standard Adam by applying the decay directly to the parameters rather than adding it to the gradient. The update rule is:

\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\ \theta_t &= \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \alpha \lambda \theta_{t-1} \end{aligned}

The final term $-\alpha \lambda \theta_{t-1}$ is the decoupled weight decay; it is applied after the adaptive gradient step, not mixed into $g_t$ .

Parameters

paramsiterable of Parameter or iterable of dict

Parameters to optimise, or a list of parameter-group dicts.

lrfloat= 0.001

Learning rate

\alpha

(default: 1e-3).

betastuple of float= (0.9, 0.999)

Coefficients

(\beta_1, \beta_2)

for computing running averages of the gradient and its square (default: (0.9, 0.999)).

epsfloat= 1e-08

Term

\epsilon

added to the denominator for numerical stability (default: 1e-8).

weight_decayfloat= 0.01

Decoupled weight decay coefficient

\lambda

(default: 1e-2).

amsgradbool= False

Whether to use the AMSGrad variant that maintains the maximum of past squared gradients (default: False).

Attributes

param_groupslist of dict

Parameter groups, each containing "params", "lr", "beta1", "beta2", "eps", "weight_decay", and "amsgrad".

defaultsdict

Default hyperparameter values.

Notes

Decoupled weight decay makes the effective regularisation independent of the learning rate, which simplifies hyperparameter tuning. AdamW is the recommended default optimizer for transformer-based models and generally outperforms Adam with L2 regularisation.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.AdamW(
...     model.parameters(), lr=1e-4, weight_decay=1e-2
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False)

source

Initialise the AdamW. See the class docstring for parameter semantics.

step

→Tensor or None

step(closure: _OptimizerClosure = None)

source

Perform a single AdamW optimisation step.

Calls the engine-level AdamW update for each parameter group, which applies the adaptive gradient update followed by decoupled weight decay directly on the parameters.

Parameters

closurecallable= None

A closure that re-evaluates the model and returns the loss. If provided, it is called before the parameter update and its return value is passed back to the caller.

Returns

Tensor or None

The loss returned by closure, or None if no closure was provided.

Examples

>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()