optim.Adam

class lucid.optim.Adam(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[int | float | complex, int | float | complex] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False)

The Adam class implements the Adam optimization algorithm. Adam (Adaptive Moment Estimation) maintains an exponentially decaying average of past gradients (first moment) and squared gradients (second moment) to adaptively adjust the learning rate of each parameter. This makes Adam suitable for large-scale machine learning tasks and models with sparse gradients.

Class Signature

class Adam(optim.Optimizer):
    def __init__(
        self,
        params: Iterable[nn.Parameter],
        lr: float = 1e-3,
        betas: tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-8,
        weight_decay: float = 0.0,
        amsgrad: bool = False,
    ) -> None

Parameters

  • Learning Rate (`lr`): Controls the step size during parameter updates. A higher learning rate can speed up convergence but may lead to instability.

  • Betas (`betas`): A tuple of two coefficients controlling the exponential decay of moment estimates. The first value (beta1) controls the decay rate for the first moment (mean of gradients), while the second value (beta2) controls the decay rate for the second moment (variance of gradients).

  • Epsilon (`eps`): A small constant added to the denominator to prevent division by zero. This improves numerical stability during training.

  • Weight Decay (`weight_decay`): Adds an L2 penalty (weight decay) to the parameter updates, encouraging smaller weights. This can help prevent overfitting.

  • AMSGrad (`amsgrad`): If True, uses the AMSGrad variant of Adam, which ensures that the denominator does not decrease over time. This helps address some issues with non-convergence in certain settings.

Algorithm

The Adam optimization algorithm updates each parameter according to the following formulas:

\[ \begin{align}\begin{aligned}m_{t} &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{t}\\v_{t} &= \beta_2 v_{t-1} + (1 - \beta_2) \nabla_{t}^2\\\hat{m}_{t} &= \frac{m_{t}}{1 - \beta_1^t}\\\hat{v}_{t} &= \frac{v_{t}}{1 - \beta_2^t}\\\theta_{t+1} &= \theta_{t} - \frac{\text{lr} \cdot \hat{m}_{t}}{\sqrt{\hat{v}_{t}} + \epsilon}\end{aligned}\end{align} \]

Where:

  • \(\nabla_{t}\) is the gradient of the loss with respect to the parameter at iteration \(t\).

  • \(m_{t}\) and \(v_{t}\) are the exponentially decaying first and second moment estimates.

  • \(\hat{m}_{t}\) and \(\hat{v}_{t}\) are bias-corrected estimates of the first and second moments.

  • \(\theta_{t}\) is the parameter at iteration \(t\).

  • \(\text{lr}\) is the learning rate.

If amsgrad is True, the maximum of all previously observed second moment estimates (\(\hat{v}_{t}\)) is used instead of the current value of \(\hat{v}_{t}\).

Examples

Using the Adam Optimizer

import lucid.optim as optim
import lucid.nn as nn

# Define a simple model
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.param = nn.Parameter([1.0, 2.0, 3.0])

    def forward(self, x):
        return x * self.param

# Initialize model and Adam optimizer
model = MyModel()
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-4, amsgrad=True)

# Training loop
for input, target in data:
    optimizer.zero_grad()
    output = model(input)
    loss = compute_loss(output, target)
    loss.backward()
    optimizer.step()

Inspecting Optimizer State

Use the state_dict() and load_state_dict() methods to save and load the optimizer state.

# Save state
optimizer_state = optimizer.state_dict()

# Load state
optimizer.load_state_dict(optimizer_state)