class

SGD

extendsOptimizer
SGD(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False)
source

Stochastic Gradient Descent optimizer with optional momentum and weight decay.

Implements the classic SGD update rule. Without momentum the update is:

θt+1=θtαL(θt)\theta_{t+1} = \theta_t - \alpha \, \nabla L(\theta_t)

With momentum (Polyak momentum), a velocity buffer vv is maintained and the update becomes:

vt+1=μvt+(1τ)L(θt)θt+1=θtαvt+1\begin{aligned} v_{t+1} &= \mu \, v_t + (1 - \tau) \, \nabla L(\theta_t) \\ \theta_{t+1} &= \theta_t - \alpha \, v_{t+1} \end{aligned}

where μ\mu is the momentum factor and τ\tau is the dampening coefficient. With Nesterov momentum the gradient is evaluated at the lookahead position:

θt+1=θtα(L(θt)+μvt+1)\theta_{t+1} = \theta_t - \alpha \bigl(\nabla L(\theta_t) + \mu \, v_{t+1}\bigr)

L2 weight decay adds λθt\lambda \theta_t to the gradient before the momentum step:

gt=L(θt)+λθtg_t = \nabla L(\theta_t) + \lambda \, \theta_t

Parameters

paramsiterable of Parameter or iterable of dict
Parameters to optimise, or a list of parameter-group dicts.
lrfloat
Learning rate α\alpha.
momentumfloat= 0
Momentum factor μ\mu (default: 0). Set to a value such as 0.9 to enable momentum.
dampeningfloat= 0
Dampening factor τ\tau for the momentum buffer (default: 0). Has no effect when momentum=0.
weight_decayfloat= 0
L2 regularisation coefficient λ\lambda (default: 0).
nesterovbool= False
If True, use Nesterov momentum (default: False). Requires momentum > 0 and dampening == 0.

Attributes

param_groupslist of dict
Parameter groups, each containing "params", "lr", "momentum", "dampening", "weight_decay", and "nesterov".
defaultsdict
Default hyperparameter values.

Notes

SGD with momentum is the de-facto standard for training image classifiers. Nesterov momentum often converges faster than vanilla momentum because it incorporates a correction based on where the parameters will be after the momentum step.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.SGD(
...     model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False)
source

Initialise the SGD. See the class docstring for parameter semantics.

fn

step

Tensor | None
step(closure: _OptimizerClosure = None)
source

Perform a single SGD step.