class

TransformerDecoder

extendsModule

TransformerDecoder(decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Module | None = None)

source edit

A stack of $N$ identical TransformerDecoderLayer modules.

The decoder takes a target sequence and a memory tensor produced by the encoder and generates a sequence of hidden representations that are conditioned on both. Every decoder layer receives the same memory as its cross-attention context, allowing the decoder to attend to all encoder positions at each decoding step.

Formally, letting $\text{Layer}_i$ denote the $i$ -th decoder layer:

h^{(0)} = \text{tgt}

h^{(i)} = \text{Layer}_i\!\left(h^{(i-1)},\; \text{memory},\; \text{tgt\_mask},\; \text{memory\_mask}\right), \quad i = 1, \ldots, N

\text{output} = \begin{cases} \text{LayerNorm}\!\left(h^{(N)}\right) & \text{if norm is set} \\ h^{(N)} & \text{otherwise} \end{cases}

Parameters

decoder_layerTransformerDecoderLayer

A single configured decoder layer. Its hyperparameters are copied to construct num_layers independent instances. The provided layer itself is not reused.

num_layersint

Number of decoder layers

N

to stack.

normModule or None= None

Optional final normalisation applied to the last layer's output. Default: None.

Attributes

layerslist[TransformerDecoderLayer]

The

N

decoder layer instances, registered as sub-modules "0", "1", …, "N-1".

normModule or None

The optional post-stack normalisation module.

num_layersint

Number of stacked layers.

Notes

tgt: $(T, N, E)$ when batch_first=False, or $(N, T, E)$ when batch_first=True.
memory: $(S, N, E)$ when batch_first=False, or $(N, S, E)$ when batch_first=True.
Output: same shape as tgt.

where $T$ is the target sequence length, $S$ is the source sequence length, $N$ is the batch size, and $E$ = d_model.

The TransformerDecoder is the natural building block for sequence-to-sequence tasks (machine translation, summarisation, speech synthesis) and for autoregressive language modelling when combined with an encoder that processes the conditioning context.

For autoregressive generation at inference time, the decoder is invoked step-by-step: at each step, the tgt tensor grows by one token (the previously generated token is appended), and the memory tensor remains fixed as the encoded source. A causal mask passed through tgt_mask ensures each position only attends to previously generated tokens.

Examples

**Six-layer decoder** (standard seq2seq decoder):
>>> import lucid
>>> import lucid.nn as nn
>>> d_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> norm = nn.LayerNorm(512)
>>> decoder = nn.TransformerDecoder(d_layer, num_layers=6, norm=norm)
>>> tgt = lucid.randn(15, 2, 512)       # (tgt_len, batch, d_model)
>>> memory = lucid.randn(30, 2, 512)    # (src_len, batch, d_model)
>>> out = decoder(tgt, memory)
>>> out.shape
(15, 2, 512)
**Decoder for autoregressive generation** (batch_first):
>>> d_layer = nn.TransformerDecoderLayer(
...     d_model=256, nhead=4, batch_first=True
... )
>>> decoder = nn.TransformerDecoder(d_layer, num_layers=4)
>>> memory = lucid.randn(2, 20, 256)    # encoder output
>>> # At step t, tgt contains the t tokens generated so far
>>> tgt = lucid.randn(2, 5, 256)
>>> out = decoder(tgt, memory, tgt_mask=None)
>>> out.shape
(2, 5, 256)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Module | None = None)

source edit

Initialise the TransformerDecoder module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None, past_key_value: Cache | None = None, use_cache: bool = False, cache_position: Tensor | None = None)

source edit

Run the forward pass of the module.

Parameters

tgtTensor

See the class docstring.

memoryTensor

See the class docstring.

tgt_maskTensor= None

See the class docstring.

memory_maskTensor= None

See the class docstring.

tgt_key_padding_maskTensor= None

See the class docstring.

memory_key_padding_maskTensor= None

See the class docstring.

past_key_value(Cache or None, optional, keyword - only)= None

Shared encoder-decoder cache threaded to every layer (each layer writes its own layer_idx slot).

use_cache(bool, optional, keyword - only)= False

Enable incremental KV caching.

cache_position(Tensor or None, optional, keyword - only)= None

Absolute positions of the new tgt tokens; accepted for parity.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

**Six-layer decoder** (standard seq2seq decoder): >>> import lucid >>> import lucid.nn as nn >>> d_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8) >>> norm = nn.LayerNorm(512) >>> decoder = nn.TransformerDecoder(d_layer, num_layers=6, norm=norm) >>> tgt = lucid.randn(15, 2, 512) # (tgt_len, batch, d_model) >>> memory = lucid.randn(30, 2, 512) # (src_len, batch, d_model) >>> out = decoder(tgt, memory) >>> out.shape (15, 2, 512) **Decoder for autoregressive generation** (batch_first): >>> d_layer = nn.TransformerDecoderLayer( ... d_model=256, nhead=4, batch_first=True ... ) >>> decoder = nn.TransformerDecoder(d_layer, num_layers=4) >>> memory = lucid.randn(2, 20, 256) # encoder output >>> # At step t, tgt contains the t tokens generated so far >>> tgt = lucid.randn(2, 5, 256) >>> out = decoder(tgt, memory, tgt_mask=None) >>> out.shape (2, 5, 256)