lr_scheduler.NoamScheduler¶

class lucid.optim.lr_scheduler.NoamScheduler(optimizer: Optimizer, model_size: int, warmup_steps: int, factor: float = 1.0, last_epoch: int = -1, verbose: bool = False)¶

The NoamScheduler implements the warmup + inverse square-root decay strategy popularized by the Transformer. It gradually increases the learning rate during the warmup window and then decays it proportionally to \(1 / \sqrt{t}\).

Class Signature¶

class NoamScheduler(
    optimizer: Optimizer,
    model_size: int,
    warmup_steps: int,
    factor: float = 1.0,
    last_epoch: int = -1,
    verbose: bool = False,
)

Parameters¶

optimizer (Optimizer): The optimizer whose learning rate is controlled.
model_size (int): Typically the Transformer hidden dimension. Used to scale the schedule.
warmup_steps (int): Number of steps over which to linearly ramp up the learning rate.
factor (float, optional): Global scaling factor for the learning rate curve. Default: 1.0.
last_epoch (int, optional): Index of the last epoch when resuming training. Default: -1.
verbose (bool, optional): If True, logs learning rate changes every step. Default: False.

Mathematical Formula¶

The learning rate at step \(t\) is:

\[\eta_t = \text{factor} \cdot \text{model\_size}^{-0.5} \cdot \min(t^{-0.5}, \ t \cdot \text{warmup\_steps}^{-1.5})\]

Where: - \(t\) is the current step (1-indexed). - \(\eta_t\) is the scaled learning rate factor.

Methods¶

get_lr() -> list[float]: Returns the scaled learning rates for each optimizer parameter group.
step(epoch: Optional[int] = None) -> None: Advances the scheduler, updating optimizer learning rates.

Usage Example¶

import lucid.optim as optim
from lucid.optim.lr_scheduler import NoamScheduler

optimizer = optim.Adam(model.parameters(), lr=1.0)
scheduler = NoamScheduler(
    optimizer,
    model_size=512,
    warmup_steps=4000,
    factor=2.0,
)

for step in range(1, 10001):
    optimizer.step()
    scheduler.step()
    if step % 1000 == 0:
        print(f"Step {step}, Learning Rate: {scheduler.last_lr}")

Note

Noam scheduling is effective for Transformer-style architectures where large model dimensions benefit from warmup to stabilize early training.