class

SparseAdam

extendsOptimizer

SparseAdam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08)

source

Adam optimizer designed for sparse gradient workloads.

Implements a dense Adam update that is API-compatible with the sparse variant used in embedding-heavy models. All moment state is stored as dense lucid._C.engine.TensorImpl buffers and updated using engine operations, so the optimizer works correctly on both CPU and GPU.

The update rule is identical to standard Adam:

\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\ \theta_t &= \theta_{t-1} - \frac{\alpha \sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \cdot \frac{m_t}{\sqrt{v_t} + \epsilon} \end{aligned}

Moment buffers are allocated lazily on the first step for each parameter to avoid unnecessary memory usage for parameters with all-zero gradients.

Parameters

paramsiterable of Parameter or iterable of dict

Parameters to optimise, or a list of parameter-group dicts.

lrfloat= 0.001

Learning rate

\alpha

(default: 1e-3).

betastuple of float= (0.9, 0.999)

Coefficients

(\beta_1, \beta_2)

for the first- and second-moment estimates (default: (0.9, 0.999)).

epsfloat= 1e-08

Term

\epsilon

for numerical stability (default: 1e-8).

Attributes

param_groupslist of dict

Parameter groups with keys "params", "lr", "betas", and "eps".

defaultsdict

Default hyperparameter values.

Notes

SparseAdam skips parameter updates when the gradient is None, which is the common case for embedding rows that were not accessed in the current mini-batch. This makes it efficient even though the underlying storage is dense.

Unlike the engine-backed optimizers (Adam, AdamW, SGD), SparseAdam manages its own Python-side moment buffers and does not use a C++ engine optimizer object.

Examples

>>> import lucid.optim as optim
>>> optimizer = optim.SparseAdam(
...     model.embedding.parameters(), lr=1e-3
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()

Methods (2)

dunder

init

→None

__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08)

source

Initialise the SparseAdam. See the class docstring for parameter semantics.

step

→Tensor or None

step(closure: _OptimizerClosure = None)

source

Perform a single SparseAdam optimisation step.

Iterates over all parameters. For each parameter whose gradient is not None, lazily initialises first- and second-moment buffers (on first call), increments the step counter, computes bias-corrected moment estimates, and applies the Adam update in-place via engine ops.

Parameters with None gradients (e.g. embedding rows not accessed in the current batch) are skipped entirely.

Parameters

closurecallable= None

A closure that re-evaluates the model and returns the loss. If provided, it is called before the parameter updates.

Returns

Tensor or None

The loss returned by closure, or None if no closure was provided.

Examples

>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()