class
Adagrad
extends
OptimizerAdagrad(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, eps: float = 1e-10)Adaptive Gradient optimizer.
Adagrad adapts the learning rate for each parameter based on the sum of all past squared gradients. Parameters that receive large or frequent gradient updates get smaller effective learning rates:
where the effective learning rate decays with time as:
and is the lr_decay parameter.
Parameters
paramsiterable of Parameter or iterable of dictParameters to optimise, or a list of parameter-group dicts.
lrfloat= 0.01Initial learning rate (default:
1e-2).lr_decayfloat= 0Learning-rate decay applied to the effective LR at each step
(default:
0).weight_decayfloat= 0L2 regularisation coefficient (default:
0).epsfloat= 1e-10Term added to the denominator for numerical
stability (default:
1e-10).Attributes
param_groupslist of dictParameter groups with keys
"params", "lr", "lr_decay",
"weight_decay", and "eps".defaultsdictDefault hyperparameter values.
Notes
Adagrad is particularly well-suited for sparse data (e.g. NLP with word embeddings) because infrequently updated parameters retain a larger effective learning rate. The main drawback is that the accumulated squared-gradient sum only grows, so the effective learning rate can become vanishingly small over long training runs.
Examples
>>> import lucid.optim as optim
>>> optimizer = optim.Adagrad(model.parameters(), lr=1e-2)
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()Methods (2)
dunder
__init__
→None__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, eps: float = 1e-10)Initialise the Adagrad. See the class docstring for parameter semantics.
fn
step
→Tensor | Nonestep(closure: _OptimizerClosure = None)Perform a single Adagrad step.