class

LSTM

extendsModule

LSTM(input_size: int, hidden_size: int, num_layers: int = 1, bias: bool = True, batch_first: bool = False, dropout: float = 0.0, bidirectional: bool = False, proj_size: int = 0, device: DeviceLike = None, dtype: DTypeLike = None)

source

Long Short-Term Memory (LSTM) recurrent layer.

Applies a multi-layer LSTM over an input sequence. At each time step $t$ the following gated update equations are evaluated:

\begin{aligned} i_t &= \sigma(W_{ii}\,x_t + b_{ii} + W_{hi}\,h_{t-1} + b_{hi}) \\ f_t &= \sigma(W_{if}\,x_t + b_{if} + W_{hf}\,h_{t-1} + b_{hf}) \\ g_t &= \tanh(W_{ig}\,x_t + b_{ig} + W_{hg}\,h_{t-1} + b_{hg}) \\ o_t &= \sigma(W_{io}\,x_t + b_{io} + W_{ho}\,h_{t-1} + b_{ho}) \\ c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}

where $\sigma$ is the sigmoid function, $\odot$ is the element-wise (Hadamard) product, $x_t$ is the input at step $t$ , and $h_{t-1}$ , $c_{t-1}$ are the hidden and cell states carried from the previous step.

The four gates have the following roles:

Input gate $i_t$ — controls how much new information enters the cell state.
Forget gate $f_t$ — controls how much of the previous cell state is retained.
Cell gate $g_t$ — candidate new content to add to the cell state.
Output gate $o_t$ — controls what portion of the cell state is exposed as the hidden state $h_t$ .

When num_layers > 1 the hidden state output of layer $\ell$ becomes the input to layer $\ell + 1$ . Inter-layer dropout (probability dropout) is applied between every pair of adjacent layers during training.

When bidirectional=True, two independent LSTMs process the sequence in opposite directions. Their output hidden states are concatenated along the feature dimension at each time step, so the output feature size is 2 * hidden_size. The hidden/cell state tensors h_n and c_n pack both directions along their leading axis in the order [fwd_l0, rev_l0, fwd_l1, rev_l1, ...].

Parameters

input_sizeint

Number of expected features in each input vector

x_t

hidden_sizeint

Number of features in the hidden state

h_t

(denoted

H

below).

num_layersint= 1

Number of stacked recurrent layers. Default: 1.

biasbool= True

If False all bias terms are treated as zero (the parameter tensors still exist for API compatibility, but are set to zero and contribute nothing numerically). Default: True.

batch_firstbool= False

If True the expected input/output shape is (N, L, input_size) / (N, L, H * D) rather than the default (L, N, input_size) / (L, N, H * D). Default: False.

dropoutfloat= 0.0

Dropout probability applied to the outputs of every LSTM layer except the last. 0.0 disables dropout. Default: 0.0.

bidirectionalbool= False

If True use a bidirectional LSTM; doubles the output feature size and the leading dimension of h_n / c_n. Default: False.

proj_sizeint= 0

If > 0, adds a learnable linear projection of size proj_size after each LSTM cell output, reducing the recurrent hidden dimension fed to the next time step. Default: 0 (disabled).

deviceDeviceLike= None

Device on which to allocate weight tensors.

dtypeDTypeLike= None

Data type for weight tensors.

Attributes

weight_ih_l0Parameter, shape ``(4H, I)``

Input–hidden weight matrix for layer 0, forward direction. Stacks the four gate weight matrices

[W_{ii}; W_{if}; W_{ig}; W_{io}]

. Here

I

is input_size and

H

is hidden_size.

weight_hh_l0Parameter, shape ``(4H, H)``

Hidden–hidden weight matrix for layer 0, forward direction. Stacks

[W_{hi}; W_{hf}; W_{hg}; W_{ho}]

. When proj_size > 0 the second dimension shrinks to proj_size (the recurrent state is the projected output).

bias_ih_l0Parameter, shape ``(4H,)``

Input–hidden bias for layer 0, forward direction. Present only when bias=True.

bias_hh_l0Parameter, shape ``(4H,)``

Hidden–hidden bias for layer 0, forward direction. Present only when bias=True.

For layer ``k`` in the forward direction substitute ``l0`` →

``l{k}``; for the backward direction append ``_reverse``,

e.g. ``weight_ih_l1_reverse``.

Notes

Input x: (L, N, input_size) or (N, L, input_size) when batch_first=True. L = sequence length, N = batch size.
h_0, c_0 (optional): (D * num_layers, N, H) where D = 2 if bidirectional else 1. When omitted they default to zero tensors.
output: (L, N, D * H) or (N, L, D * H) when batch_first=True.
h_n: (D * num_layers, N, H) — hidden state after the last time step for every layer and direction.
c_n: (D * num_layers, N, H) — cell state after the last time step for every layer and direction.

All weight matrices are initialised with a uniform distribution $\mathcal{U}(-1/\sqrt{H},\, 1/\sqrt{H})$ , matching the initialisation convention of the reference framework.

The C++ engine processes a single layer and a single direction per call. Multi-layer and bidirectional configurations are composed entirely in Python: the module loops over layers and directions, applies inter-layer dropout, and concatenates bidirectional outputs.

flatten_parameters is a no-op provided purely for API compatibility with codebases that call it before the forward pass.

PackedSequence input is not yet supported. Use lucid.nn.utils.rnn.pad_packed_sequence to unpack manually before passing to this module.

LSTMCell : Single time-step LSTM cell. GRU : Gated Recurrent Unit (simpler, no cell state). RNN : Vanilla Elman RNN.

Examples

Basic sequence encoding (batch-first convention):
>>> import lucid, lucid.nn as nn
>>> lstm = nn.LSTM(input_size=10, hidden_size=20, batch_first=True)
>>> x = lucid.randn(2, 5, 10)          # (N=2, L=5, I=10)
>>> output, (h_n, c_n) = lstm(x)
>>> output.shape, h_n.shape, c_n.shape
((2, 5, 20), (1, 2, 20), (1, 2, 20))
Bidirectional, 2-layer LSTM with dropout:
>>> lstm2 = nn.LSTM(
...     input_size=16, hidden_size=32,
...     num_layers=2, dropout=0.3,
...     bidirectional=True, batch_first=True,
... )
>>> x2 = lucid.randn(4, 12, 16)        # (N=4, L=12, I=16)
>>> out2, (h_n2, c_n2) = lstm2(x2)
>>> out2.shape    # D*H = 2*32 = 64
(4, 12, 64)
>>> h_n2.shape    # D*num_layers = 4
(4, 4, 32)

Methods (4)

dunder

init

→None

__init__(input_size: int, hidden_size: int, num_layers: int = 1, bias: bool = True, batch_first: bool = False, dropout: float = 0.0, bidirectional: bool = False, proj_size: int = 0, device: DeviceLike = None, dtype: DTypeLike = None)

source

Initialise the LSTM module. See the class docstring for parameter semantics.

flatten_parameters

→None

flatten_parameters()

source

No-op for API compatibility with reference recurrent modules.

Some external codepaths call flatten_parameters() to coalesce weights for cuDNN; Lucid's BLAS / MLX path has no such concern, so this is a placeholder that lets such code run unchanged.

forward

→tuple[Tensor, tuple[Tensor, Tensor]]

forward(x: Tensor, hx: tuple[Tensor, Tensor] | None = None)

source

Multi-layer × bidirectional forward.

The C++ engine only handles a single layer in a single direction at a time, so this loops over layers and directions, applying inter-layer dropout and concatenating bidirectional outputs as the input to the next layer.

extra_repr

→str

extra_repr()

source

Return a string representation of the layer's configuration.