class

Adagrad

extendsOptimizer
Adagrad(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, eps: float = 1e-10)
source

Adaptive Gradient optimizer.

Adagrad adapts the learning rate for each parameter based on the sum of all past squared gradients. Parameters that receive large or frequent gradient updates get smaller effective learning rates:

Gt=Gt1+gt2θt=θt1ηtGt+ϵgt\begin{aligned} G_t &= G_{t-1} + g_t^2 \\ \theta_t &= \theta_{t-1} - \frac{\eta_t}{\sqrt{G_t} + \epsilon} \, g_t \end{aligned}

where the effective learning rate decays with time as:

ηt=η01+(t1)λ\eta_t = \frac{\eta_0}{1 + (t - 1) \cdot \lambda}

and λ\lambda is the lr_decay parameter.

Parameters

paramsiterable of Parameter or iterable of dict
Parameters to optimise, or a list of parameter-group dicts.
lrfloat= 0.01
Initial learning rate η0\eta_0 (default: 1e-2).
lr_decayfloat= 0
Learning-rate decay applied to the effective LR at each step λ\lambda (default: 0).
weight_decayfloat= 0
L2 regularisation coefficient (default: 0).
epsfloat= 1e-10
Term ϵ\epsilon added to the denominator for numerical stability (default: 1e-10).

Attributes

param_groupslist of dict
Parameter groups with keys "params", "lr", "lr_decay", "weight_decay", and "eps".
defaultsdict
Default hyperparameter values.

Notes

Adagrad is particularly well-suited for sparse data (e.g. NLP with word embeddings) because infrequently updated parameters retain a larger effective learning rate. The main drawback is that the accumulated squared-gradient sum GtG_t only grows, so the effective learning rate can become vanishingly small over long training runs.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.Adagrad(model.parameters(), lr=1e-2)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, eps: float = 1e-10)
source

Initialise the Adagrad. See the class docstring for parameter semantics.

fn

step

Tensor | None
step(closure: _OptimizerClosure = None)
source

Perform a single Adagrad step.