class

ASGD

extendsOptimizer

ASGD(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lambd: float = 0.0001, alpha: float = 0.75, t0: float = 1000000.0, weight_decay: float = 0)

source

Averaged Stochastic Gradient Descent optimizer.

ASGD performs standard SGD updates but maintains a running average of the iterate sequence, which serves as the final parameter estimate. The averaging improves convergence in the presence of noise and is particularly effective near the end of training.

The SGD update with L2 regularisation is:

\theta_t = \theta_{t-1} - \eta_t \bigl(g_t + \lambda \, \theta_{t-1}\bigr)

where the effective learning rate decays as:

\eta_t = \frac{\eta_0}{(1 + \lambda \, \eta_0 \, t)^\alpha}

The Polyak–Ruppert average is then:

\bar{\theta}_t = \frac{1}{t - t_0} \sum_{k=t_0}^{t} \theta_k \quad \text{for } t \ge t_0

Parameters

paramsiterable of Parameter or iterable of dict

Parameters to optimise, or a list of parameter-group dicts.

lrfloat= 0.01

Initial learning rate

\eta_0

(default: 1e-2).

lambdfloat= 0.0001

Decay term

\lambda

(default: 1e-4).

alphafloat= 0.75

Power for LR decay

\alpha

(default: 0.75).

t0float= 1000000.0

Step at which averaging begins (default: 1e6).

weight_decayfloat= 0

L2 regularisation coefficient (default: 0).

Attributes

param_groupslist of dict

Parameter groups with keys "params", "lr", "lambd", "alpha", "t0", and "weight_decay".

defaultsdict

Default hyperparameter values.

Notes

ASGD can match or exceed the convergence rate of SGD with careful learning-rate tuning, and the averaging step provides additional regularisation. The default t0=1e6 delays averaging until very late in training; reduce it to start averaging earlier.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.ASGD(model.parameters(), lr=1e-2, t0=1e5)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lambd: float = 0.0001, alpha: float = 0.75, t0: float = 1000000.0, weight_decay: float = 0)

source

Initialise the ASGD. See the class docstring for parameter semantics.

step

→Tensor | None

step(closure: _OptimizerClosure = None)

source

Perform a single ASGD step.