class
NoamScheduler
extends
_LRSchedulerNoamScheduler(optimizer: Optimizer, d_model: int, warmup_steps: int, last_epoch: int = -1, verbose: bool = False)Noam learning rate schedule from the original Transformer paper.
The learning rate increases linearly during a warmup phase and then decays proportionally to the inverse square root of the step number:
where is the model dimensionality and is the number of warmup steps.
Parameters
optimizerOptimizerWrapped optimizer. The
lr in each param group is set to the
Noam value directly (the base learning rate is not used as a
multiplicative factor).d_modelintDimensionality of the model (e.g. 512 for the base Transformer).
Larger models have smaller peak learning rates.
warmup_stepsintNumber of warmup steps during which the LR increases linearly.
A typical value is 4000.
last_epochint= -1The index of the last epoch (default:
-1).verbosebool= FalsePrint the updated LR after each step if
True (default: False).Attributes
d_modelintModel dimension used in the scaling formula.
warmup_stepsintWarmup period length.
Notes
The Noam schedule is designed to be called once per training step (i.e. per batch), not once per epoch. The learning rate peaks at step and then decreases as .
Examples
>>> import lucid.optim as optim
>>> optimizer = optim.Adam(model.parameters(), lr=1.0, betas=(0.9, 0.98))
>>> scheduler = optim.NoamScheduler(optimizer, d_model=512, warmup_steps=4000)
>>> for step, batch in enumerate(dataloader):
... train_step(batch)
... optimizer.step()
... scheduler.step()Methods (2)
dunder
__init__
→None__init__(optimizer: Optimizer, d_model: int, warmup_steps: int, last_epoch: int = -1, verbose: bool = False)Initialise the NoamScheduler. See the class docstring for parameter semantics.
fn
get_lr
→list[float]get_lr()Compute the learning rate for each parameter group at the current step.
Returns
list[float]One learning rate per param group, derived from the schedule formula documented in the class docstring.