SparseAdam
OptimizerSparseAdam(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08)Adam optimizer designed for sparse gradient workloads.
Implements a dense Adam update that is API-compatible with the sparse
variant used in embedding-heavy models. All moment state is stored as
dense lucid._C.engine.TensorImpl buffers and updated using
engine operations, so the optimizer works correctly on both CPU and GPU.
The update rule is identical to standard Adam:
Moment buffers are allocated lazily on the first step for each parameter to avoid unnecessary memory usage for parameters with all-zero gradients.
Parameters
paramsiterable of Parameter or iterable of dictlrfloat= 0.0011e-3).betastuple of float= (0.9, 0.999)(0.9, 0.999)).epsfloat= 1e-081e-8).Attributes
param_groupslist of dict"params", "lr", "betas",
and "eps".defaultsdictNotes
SparseAdam skips parameter updates when the gradient is None,
which is the common case for embedding rows that were not accessed in
the current mini-batch. This makes it efficient even though the
underlying storage is dense.
Unlike the engine-backed optimizers (Adam, AdamW, SGD), SparseAdam
manages its own Python-side moment buffers and does not use a C++
engine optimizer object.
Examples
>>> import lucid.optim as optim
>>> optimizer = optim.SparseAdam(
... model.embedding.parameters(), lr=1e-3
... )
>>> optimizer.zero_grad()
>>> loss.backward()
>>> optimizer.step()Methods (2)
__init__
→None__init__(params: Iterable[Parameter] | Iterable[dict[str, object]], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08)Initialise the SparseAdam. See the class docstring for parameter semantics.
step
→Tensor or Nonestep(closure: _OptimizerClosure = None)Perform a single SparseAdam optimisation step.
Iterates over all parameters. For each parameter whose gradient
is not None, lazily initialises first- and second-moment
buffers (on first call), increments the step counter, computes
bias-corrected moment estimates, and applies the Adam update
in-place via engine ops.
Parameters with None gradients (e.g. embedding rows not
accessed in the current batch) are skipped entirely.
Parameters
closurecallable= NoneReturns
Tensor or NoneThe loss returned by closure, or None if no closure
was provided.
Examples
>>> optimizer.zero_grad()
>>> loss = model(inputs)
>>> loss.backward()
>>> optimizer.step()