class

NoamScheduler

extends_LRScheduler

NoamScheduler(optimizer: Optimizer, d_model: int, warmup_steps: int, last_epoch: int = -1, verbose: bool = False)

source

Noam learning rate schedule from the original Transformer paper.

The learning rate increases linearly during a warmup phase and then decays proportionally to the inverse square root of the step number:

\eta_t = d_{\text{model}}^{-0.5} \cdot \min\!\left(t^{-0.5},\; t \cdot w^{-1.5}\right)

where $d_{\text{model}}$ is the model dimensionality and $w$ is the number of warmup steps.

Parameters

optimizerOptimizer

Wrapped optimizer. The lr in each param group is set to the Noam value directly (the base learning rate is not used as a multiplicative factor).

d_modelint

Dimensionality of the model (e.g. 512 for the base Transformer). Larger models have smaller peak learning rates.

warmup_stepsint

Number of warmup steps during which the LR increases linearly. A typical value is 4000.

last_epochint= -1

The index of the last epoch (default: -1).

verbosebool= False

Print the updated LR after each step if True (default: False).

Attributes

d_modelint

Model dimension used in the scaling formula.

warmup_stepsint

Warmup period length.

Notes

The Noam schedule is designed to be called once per training step (i.e. per batch), not once per epoch. The learning rate peaks at step $t^* = w$ and then decreases as $t^{-0.5}$ .

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adam(model.parameters(), lr=1.0, betas=(0.9, 0.98))
>>> scheduler = optim.NoamScheduler(optimizer, d_model=512, warmup_steps=4000)
>>> for step, batch in enumerate(dataloader):
...     train_step(batch)
...     optimizer.step()
...     scheduler.step()

Methods (2)

dunder

init

→None

__init__(optimizer: Optimizer, d_model: int, warmup_steps: int, last_epoch: int = -1, verbose: bool = False)

source

Initialise the NoamScheduler. See the class docstring for parameter semantics.

get_lr

→list[float]

get_lr()

source

Compute the learning rate for each parameter group at the current step.

Returns

list[float]

One learning rate per param group, derived from the schedule formula documented in the class docstring.