ASGD
OptimizerASGD(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lambd: float = 0.0001, alpha: float = 0.75, t0: float = 1000000.0, weight_decay: float = 0)Averaged Stochastic Gradient Descent optimizer.
ASGD performs standard SGD updates but maintains a running average of the iterate sequence, which serves as the final parameter estimate. The averaging improves convergence in the presence of noise and is particularly effective near the end of training.
The SGD update with L2 regularisation is:
where the effective learning rate decays as:
The Polyak–Ruppert average is then:
Parameters
paramsiterable of Parameter or iterable of dictlrfloat= 0.011e-2).lambdfloat= 0.00011e-4).alphafloat= 0.750.75).t0float= 1000000.01e6).weight_decayfloat= 00).Attributes
param_groupslist of dict"params", "lr", "lambd",
"alpha", "t0", and "weight_decay".defaultsdictNotes
ASGD can match or exceed the convergence rate of SGD with careful
learning-rate tuning, and the averaging step provides additional
regularisation. The default t0=1e6 delays averaging until very
late in training; reduce it to start averaging earlier.
Examples
>>> import lucid.optim as optim
>>> optimizer = optim.ASGD(model.parameters(), lr=1e-2, t0=1e5)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()Methods (2)
__init__
→None__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lambd: float = 0.0001, alpha: float = 0.75, t0: float = 1000000.0, weight_decay: float = 0)Initialise the ASGD. See the class docstring for parameter semantics.
step
→Tensor | Nonestep(closure: _OptimizerClosure = None)Perform a single ASGD step.