class

SGD

extendsOptimizer

SGD(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False)

source

Stochastic Gradient Descent optimizer with optional momentum and weight decay.

Implements the classic SGD update rule. Without momentum the update is:

\theta_{t+1} = \theta_t - \alpha \, \nabla L(\theta_t)

With momentum (Polyak momentum), a velocity buffer $v$ is maintained and the update becomes:

\begin{aligned} v_{t+1} &= \mu \, v_t + (1 - \tau) \, \nabla L(\theta_t) \\ \theta_{t+1} &= \theta_t - \alpha \, v_{t+1} \end{aligned}

where $\mu$ is the momentum factor and $\tau$ is the dampening coefficient. With Nesterov momentum the gradient is evaluated at the lookahead position:

\theta_{t+1} = \theta_t - \alpha \bigl(\nabla L(\theta_t) + \mu \, v_{t+1}\bigr)

L2 weight decay adds $\lambda \theta_t$ to the gradient before the momentum step:

g_t = \nabla L(\theta_t) + \lambda \, \theta_t

Parameters

paramsiterable of Parameter or iterable of dict

Parameters to optimise, or a list of parameter-group dicts.

lrfloat

Learning rate

\alpha

momentumfloat= 0

Momentum factor

\mu

(default: 0). Set to a value such as 0.9 to enable momentum.

dampeningfloat= 0

Dampening factor

\tau

for the momentum buffer (default: 0). Has no effect when momentum=0.

weight_decayfloat= 0

L2 regularisation coefficient

\lambda

(default: 0).

nesterovbool= False

If True, use Nesterov momentum (default: False). Requires momentum > 0 and dampening == 0.

Attributes

param_groupslist of dict

Parameter groups, each containing "params", "lr", "momentum", "dampening", "weight_decay", and "nesterov".

defaultsdict

Default hyperparameter values.

Notes

SGD with momentum is the de-facto standard for training image classifiers. Nesterov momentum often converges faster than vanilla momentum because it incorporates a correction based on where the parameters will be after the momentum step.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.SGD(
...     model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False)

source

Initialise the SGD. See the class docstring for parameter semantics.

step

→Tensor | None

step(closure: _OptimizerClosure = None)

source

Perform a single SGD step.