class

SparseAdam

extendsOptimizer
SparseAdam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08)
source

Adam optimizer designed for sparse gradient workloads.

Implements a dense Adam update that is API-compatible with the sparse variant used in embedding-heavy models. All moment state is stored as dense lucid._C.engine.TensorImpl buffers and updated using engine operations, so the optimizer works correctly on both CPU and GPU.

The update rule is identical to standard Adam:

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m^t=mt1β1tv^t=vt1β2tθt=θt1α1β2t1β1tmtvt+ϵ\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\ \theta_t &= \theta_{t-1} - \frac{\alpha \sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \cdot \frac{m_t}{\sqrt{v_t} + \epsilon} \end{aligned}

Moment buffers are allocated lazily on the first step for each parameter to avoid unnecessary memory usage for parameters with all-zero gradients.

Parameters

paramsiterable of Parameter or iterable of dict
Parameters to optimise, or a list of parameter-group dicts.
lrfloat= 0.001
Learning rate α\alpha (default: 1e-3).
betastuple of float= (0.9, 0.999)
Coefficients (β1,β2)(\beta_1, \beta_2) for the first- and second-moment estimates (default: (0.9, 0.999)).
epsfloat= 1e-08
Term ϵ\epsilon for numerical stability (default: 1e-8).

Attributes

param_groupslist of dict
Parameter groups with keys "params", "lr", "betas", and "eps".
defaultsdict
Default hyperparameter values.

Notes

SparseAdam skips parameter updates when the gradient is None, which is the common case for embedding rows that were not accessed in the current mini-batch. This makes it efficient even though the underlying storage is dense.

Unlike the engine-backed optimizers (Adam, AdamW, SGD), SparseAdam manages its own Python-side moment buffers and does not use a C++ engine optimizer object.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.SparseAdam(
...     model.embedding.parameters(), lr=1e-3
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

__init__

None
__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08)
source

Initialise the SparseAdam. See the class docstring for parameter semantics.

fn

step

Tensor or None
step(closure: _OptimizerClosure = None)
source

Perform a single SparseAdam optimisation step.

Iterates over all parameters. For each parameter whose gradient is not None, lazily initialises first- and second-moment buffers (on first call), increments the step counter, computes bias-corrected moment estimates, and applies the Adam update in-place via engine ops.

Parameters with None gradients (e.g. embedding rows not accessed in the current batch) are skipped entirely.

Parameters

closurecallable= None
A closure that re-evaluates the model and returns the loss. If provided, it is called before the parameter updates.

Returns

Tensor or None

The loss returned by closure, or None if no closure was provided.

Examples

>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()