class

NoamScheduler

extends_LRScheduler
NoamScheduler(optimizer: Optimizer, d_model: int, warmup_steps: int, last_epoch: int = -1, verbose: bool = False)
source

Noam learning rate schedule from the original Transformer paper.

The learning rate increases linearly during a warmup phase and then decays proportionally to the inverse square root of the step number:

ηt=dmodel0.5min ⁣(t0.5,  tw1.5)\eta_t = d_{\text{model}}^{-0.5} \cdot \min\!\left(t^{-0.5},\; t \cdot w^{-1.5}\right)

where dmodeld_{\text{model}} is the model dimensionality and ww is the number of warmup steps.

Parameters

optimizerOptimizer
Wrapped optimizer. The lr in each param group is set to the Noam value directly (the base learning rate is not used as a multiplicative factor).
d_modelint
Dimensionality of the model (e.g. 512 for the base Transformer). Larger models have smaller peak learning rates.
warmup_stepsint
Number of warmup steps during which the LR increases linearly. A typical value is 4000.
last_epochint= -1
The index of the last epoch (default: -1).
verbosebool= False
Print the updated LR after each step if True (default: False).

Attributes

d_modelint
Model dimension used in the scaling formula.
warmup_stepsint
Warmup period length.

Notes

The Noam schedule is designed to be called once per training step (i.e. per batch), not once per epoch. The learning rate peaks at step t=wt^* = w and then decreases as t0.5t^{-0.5}.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adam(model.parameters(), lr=1.0, betas=(0.9, 0.98))
>>> scheduler = optim.NoamScheduler(optimizer, d_model=512, warmup_steps=4000)
>>> for step, batch in enumerate(dataloader):
...     train_step(batch)
...     optimizer.step()
...     scheduler.step()

Methods (2)

dunder

__init__

None
__init__(optimizer: Optimizer, d_model: int, warmup_steps: int, last_epoch: int = -1, verbose: bool = False)
source

Initialise the NoamScheduler. See the class docstring for parameter semantics.

fn

get_lr

list[float]
get_lr()
source

Compute the learning rate for each parameter group at the current step.

Returns

list[float]

One learning rate per param group, derived from the schedule formula documented in the class docstring.