class

LSTM

extendsModule
LSTM(input_size: int, hidden_size: int, num_layers: int = 1, bias: bool = True, batch_first: bool = False, dropout: float = 0.0, bidirectional: bool = False, proj_size: int = 0, device: DeviceLike = None, dtype: DTypeLike = None)
source

Long Short-Term Memory (LSTM) recurrent layer.

Applies a multi-layer LSTM over an input sequence. At each time step tt the following gated update equations are evaluated:

it=σ(Wiixt+bii+Whiht1+bhi)ft=σ(Wifxt+bif+Whfht1+bhf)gt=tanh(Wigxt+big+Whght1+bhg)ot=σ(Wioxt+bio+Whoht1+bho)ct=ftct1+itgtht=ottanh(ct)\begin{aligned} i_t &= \sigma(W_{ii}\,x_t + b_{ii} + W_{hi}\,h_{t-1} + b_{hi}) \\ f_t &= \sigma(W_{if}\,x_t + b_{if} + W_{hf}\,h_{t-1} + b_{hf}) \\ g_t &= \tanh(W_{ig}\,x_t + b_{ig} + W_{hg}\,h_{t-1} + b_{hg}) \\ o_t &= \sigma(W_{io}\,x_t + b_{io} + W_{ho}\,h_{t-1} + b_{ho}) \\ c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}

where σ\sigma is the sigmoid function, \odot is the element-wise (Hadamard) product, xtx_t is the input at step tt, and ht1h_{t-1}, ct1c_{t-1} are the hidden and cell states carried from the previous step.

The four gates have the following roles:

  • Input gate iti_t — controls how much new information enters the cell state.
  • Forget gate ftf_t — controls how much of the previous cell state is retained.
  • Cell gate gtg_t — candidate new content to add to the cell state.
  • Output gate oto_t — controls what portion of the cell state is exposed as the hidden state hth_t.

When num_layers > 1 the hidden state output of layer \ell becomes the input to layer +1\ell + 1. Inter-layer dropout (probability dropout) is applied between every pair of adjacent layers during training.

When bidirectional=True, two independent LSTMs process the sequence in opposite directions. Their output hidden states are concatenated along the feature dimension at each time step, so the output feature size is 2 * hidden_size. The hidden/cell state tensors h_n and c_n pack both directions along their leading axis in the order [fwd_l0, rev_l0, fwd_l1, rev_l1, ...].

Parameters

input_sizeint
Number of expected features in each input vector xtx_t.
hidden_sizeint
Number of features in the hidden state hth_t (denoted HH below).
num_layersint= 1
Number of stacked recurrent layers. Default: 1.
biasbool= True
If False all bias terms are treated as zero (the parameter tensors still exist for API compatibility, but are set to zero and contribute nothing numerically). Default: True.
batch_firstbool= False
If True the expected input/output shape is (N, L, input_size) / (N, L, H * D) rather than the default (L, N, input_size) / (L, N, H * D). Default: False.
dropoutfloat= 0.0
Dropout probability applied to the outputs of every LSTM layer except the last. 0.0 disables dropout. Default: 0.0.
bidirectionalbool= False
If True use a bidirectional LSTM; doubles the output feature size and the leading dimension of h_n / c_n. Default: False.
proj_sizeint= 0
If > 0, adds a learnable linear projection of size proj_size after each LSTM cell output, reducing the recurrent hidden dimension fed to the next time step. Default: 0 (disabled).
deviceDeviceLike= None
Device on which to allocate weight tensors.
dtypeDTypeLike= None
Data type for weight tensors.

Attributes

weight_ih_l0Parameter, shape ``(4H, I)``
Input–hidden weight matrix for layer 0, forward direction. Stacks the four gate weight matrices [Wii;Wif;Wig;Wio][W_{ii}; W_{if}; W_{ig}; W_{io}]. Here II is input_size and HH is hidden_size.
weight_hh_l0Parameter, shape ``(4H, H)``
Hidden–hidden weight matrix for layer 0, forward direction. Stacks [Whi;Whf;Whg;Who][W_{hi}; W_{hf}; W_{hg}; W_{ho}]. When proj_size > 0 the second dimension shrinks to proj_size (the recurrent state is the projected output).
bias_ih_l0Parameter, shape ``(4H,)``
Input–hidden bias for layer 0, forward direction. Present only when bias=True.
bias_hh_l0Parameter, shape ``(4H,)``
Hidden–hidden bias for layer 0, forward direction. Present only when bias=True.
For layer ``k`` in the forward direction substitute ``l0`` →
``l{k}``; for the backward direction append ``_reverse``,
e.g. ``weight_ih_l1_reverse``.

Notes

  • Input x: (L, N, input_size) or (N, L, input_size) when batch_first=True. L = sequence length, N = batch size.
  • h_0, c_0 (optional): (D * num_layers, N, H) where D = 2 if bidirectional else 1. When omitted they default to zero tensors.
  • output: (L, N, D * H) or (N, L, D * H) when batch_first=True.
  • h_n: (D * num_layers, N, H) — hidden state after the last time step for every layer and direction.
  • c_n: (D * num_layers, N, H) — cell state after the last time step for every layer and direction.

All weight matrices are initialised with a uniform distribution U(1/H,1/H)\mathcal{U}(-1/\sqrt{H},\, 1/\sqrt{H}), matching the initialisation convention of the reference framework.

The C++ engine processes a single layer and a single direction per call. Multi-layer and bidirectional configurations are composed entirely in Python: the module loops over layers and directions, applies inter-layer dropout, and concatenates bidirectional outputs.

flatten_parameters is a no-op provided purely for API compatibility with codebases that call it before the forward pass.

PackedSequence input is not yet supported. Use lucid.nn.utils.rnn.pad_packed_sequence to unpack manually before passing to this module.

LSTMCell : Single time-step LSTM cell. GRU : Gated Recurrent Unit (simpler, no cell state). RNN : Vanilla Elman RNN.

Examples

Basic sequence encoding (batch-first convention):
>>> import lucid, lucid.nn as nn
>>> lstm = nn.LSTM(input_size=10, hidden_size=20, batch_first=True)
>>> x = lucid.randn(2, 5, 10)          # (N=2, L=5, I=10)
>>> output, (h_n, c_n) = lstm(x)
>>> output.shape, h_n.shape, c_n.shape
((2, 5, 20), (1, 2, 20), (1, 2, 20))
Bidirectional, 2-layer LSTM with dropout:
>>> lstm2 = nn.LSTM(
...     input_size=16, hidden_size=32,
...     num_layers=2, dropout=0.3,
...     bidirectional=True, batch_first=True,
... )
>>> x2 = lucid.randn(4, 12, 16)        # (N=4, L=12, I=16)
>>> out2, (h_n2, c_n2) = lstm2(x2)
>>> out2.shape    # D*H = 2*32 = 64
(4, 12, 64)
>>> h_n2.shape    # D*num_layers = 4
(4, 4, 32)

Methods (4)

dunder

__init__

None
__init__(input_size: int, hidden_size: int, num_layers: int = 1, bias: bool = True, batch_first: bool = False, dropout: float = 0.0, bidirectional: bool = False, proj_size: int = 0, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the LSTM module. See the class docstring for parameter semantics.

fn

flatten_parameters

None
flatten_parameters()
source

No-op for API compatibility with reference recurrent modules.

Some external codepaths call flatten_parameters() to coalesce weights for cuDNN; Lucid's BLAS / MLX path has no such concern, so this is a placeholder that lets such code run unchanged.

fn

forward

tuple[Tensor, tuple[Tensor, Tensor]]
forward(x: Tensor, hx: tuple[Tensor, Tensor] | None = None)
source

Multi-layer × bidirectional forward.

The C++ engine only handles a single layer in a single direction at a time, so this loops over layers and directions, applying inter-layer dropout and concatenating bidirectional outputs as the input to the next layer.

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.