LSTM
ModuleLSTM(input_size: int, hidden_size: int, num_layers: int = 1, bias: bool = True, batch_first: bool = False, dropout: float = 0.0, bidirectional: bool = False, proj_size: int = 0, device: DeviceLike = None, dtype: DTypeLike = None)Long Short-Term Memory (LSTM) recurrent layer.
Applies a multi-layer LSTM over an input sequence. At each time step the following gated update equations are evaluated:
where is the sigmoid function, is the element-wise (Hadamard) product, is the input at step , and , are the hidden and cell states carried from the previous step.
The four gates have the following roles:
- Input gate — controls how much new information enters the cell state.
- Forget gate — controls how much of the previous cell state is retained.
- Cell gate — candidate new content to add to the cell state.
- Output gate — controls what portion of the cell state is exposed as the hidden state .
When num_layers > 1 the hidden state output of layer
becomes the input to layer . Inter-layer dropout
(probability dropout) is applied between every pair of adjacent
layers during training.
When bidirectional=True, two independent LSTMs process the
sequence in opposite directions. Their output hidden states are
concatenated along the feature dimension at each time step, so the
output feature size is 2 * hidden_size. The hidden/cell state
tensors h_n and c_n pack both directions along their leading
axis in the order [fwd_l0, rev_l0, fwd_l1, rev_l1, ...].
Parameters
input_sizeinthidden_sizeintnum_layersint= 11.biasbool= TrueFalse all bias terms are treated as zero (the parameter
tensors still exist for API compatibility, but are set to zero
and contribute nothing numerically). Default: True.batch_firstbool= FalseTrue the expected input/output shape is
(N, L, input_size) / (N, L, H * D) rather than the
default (L, N, input_size) / (L, N, H * D).
Default: False.dropoutfloat= 0.00.0 disables dropout. Default: 0.0.bidirectionalbool= FalseTrue use a bidirectional LSTM; doubles the output feature
size and the leading dimension of h_n / c_n.
Default: False.proj_sizeint= 0> 0, adds a learnable linear projection of size
proj_size after each LSTM cell output, reducing the
recurrent hidden dimension fed to the next time step.
Default: 0 (disabled).deviceDeviceLike= NonedtypeDTypeLike= NoneAttributes
weight_ih_l0Parameter, shape ``(4H, I)``input_size and
is hidden_size.weight_hh_l0Parameter, shape ``(4H, H)``proj_size > 0 the second dimension shrinks to
proj_size (the recurrent state is the projected output).bias_ih_l0Parameter, shape ``(4H,)``bias=True.bias_hh_l0Parameter, shape ``(4H,)``bias=True.For layer ``k`` in the forward direction substitute ``l0`` →``l{k}``; for the backward direction append ``_reverse``,e.g. ``weight_ih_l1_reverse``.Notes
- Input
x:(L, N, input_size)or(N, L, input_size)whenbatch_first=True.L= sequence length,N= batch size. - h_0, c_0 (optional):
(D * num_layers, N, H)whereD = 2if bidirectional else1. When omitted they default to zero tensors. - output:
(L, N, D * H)or(N, L, D * H)whenbatch_first=True. - h_n:
(D * num_layers, N, H)— hidden state after the last time step for every layer and direction. - c_n:
(D * num_layers, N, H)— cell state after the last time step for every layer and direction.
All weight matrices are initialised with a uniform distribution , matching the initialisation convention of the reference framework.
The C++ engine processes a single layer and a single direction per call. Multi-layer and bidirectional configurations are composed entirely in Python: the module loops over layers and directions, applies inter-layer dropout, and concatenates bidirectional outputs.
flatten_parameters is a no-op provided purely for API
compatibility with codebases that call it before the forward pass.
PackedSequence input is not yet supported. Use
lucid.nn.utils.rnn.pad_packed_sequence to unpack manually
before passing to this module.
LSTMCell : Single time-step LSTM cell. GRU : Gated Recurrent Unit (simpler, no cell state). RNN : Vanilla Elman RNN.
Examples
Basic sequence encoding (batch-first convention):
>>> import lucid, lucid.nn as nn
>>> lstm = nn.LSTM(input_size=10, hidden_size=20, batch_first=True)
>>> x = lucid.randn(2, 5, 10) # (N=2, L=5, I=10)
>>> output, (h_n, c_n) = lstm(x)
>>> output.shape, h_n.shape, c_n.shape
((2, 5, 20), (1, 2, 20), (1, 2, 20))
Bidirectional, 2-layer LSTM with dropout:
>>> lstm2 = nn.LSTM(
... input_size=16, hidden_size=32,
... num_layers=2, dropout=0.3,
... bidirectional=True, batch_first=True,
... )
>>> x2 = lucid.randn(4, 12, 16) # (N=4, L=12, I=16)
>>> out2, (h_n2, c_n2) = lstm2(x2)
>>> out2.shape # D*H = 2*32 = 64
(4, 12, 64)
>>> h_n2.shape # D*num_layers = 4
(4, 4, 32)Methods (4)
__init__
→None__init__(input_size: int, hidden_size: int, num_layers: int = 1, bias: bool = True, batch_first: bool = False, dropout: float = 0.0, bidirectional: bool = False, proj_size: int = 0, device: DeviceLike = None, dtype: DTypeLike = None)Initialise the LSTM module. See the class docstring for parameter semantics.
flatten_parameters
→Noneflatten_parameters()No-op for API compatibility with reference recurrent modules.
Some external codepaths call flatten_parameters() to coalesce
weights for cuDNN; Lucid's BLAS / MLX path has no such concern,
so this is a placeholder that lets such code run unchanged.
forward
→tuple[Tensor, tuple[Tensor, Tensor]]forward(x: Tensor, hx: tuple[Tensor, Tensor] | None = None)Multi-layer × bidirectional forward.
The C++ engine only handles a single layer in a single direction at a time, so this loops over layers and directions, applying inter-layer dropout and concatenating bidirectional outputs as the input to the next layer.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.