class

TransformerDecoder

extendsModule
TransformerDecoder(decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Module | None = None)
source

A stack of NN identical TransformerDecoderLayer modules.

The decoder takes a target sequence and a memory tensor produced by the encoder and generates a sequence of hidden representations that are conditioned on both. Every decoder layer receives the same memory as its cross-attention context, allowing the decoder to attend to all encoder positions at each decoding step.

Formally, letting Layeri\text{Layer}_i denote the ii-th decoder layer:

h(0)=tgth^{(0)} = \text{tgt} h(i)=Layeri ⁣(h(i1),  memory,  tgt_mask,  memory_mask),i=1,,Nh^{(i)} = \text{Layer}_i\!\left(h^{(i-1)},\; \text{memory},\; \text{tgt\_mask},\; \text{memory\_mask}\right), \quad i = 1, \ldots, N output={LayerNorm ⁣(h(N))if norm is seth(N)otherwise\text{output} = \begin{cases} \text{LayerNorm}\!\left(h^{(N)}\right) & \text{if norm is set} \\ h^{(N)} & \text{otherwise} \end{cases}

Parameters

decoder_layerTransformerDecoderLayer
A single configured decoder layer. Its hyperparameters are copied to construct num_layers independent instances. The provided layer itself is not reused.
num_layersint
Number of decoder layers NN to stack.
normModule or None= None
Optional final normalisation applied to the last layer's output. Default: None.

Attributes

layerslist[TransformerDecoderLayer]
The NN decoder layer instances, registered as sub-modules "0", "1", …, "N-1".
normModule or None
The optional post-stack normalisation module.
num_layersint
Number of stacked layers.

Notes

  • tgt: (T,N,E)(T, N, E) when batch_first=False, or (N,T,E)(N, T, E) when batch_first=True.
  • memory: (S,N,E)(S, N, E) when batch_first=False, or (N,S,E)(N, S, E) when batch_first=True.
  • Output: same shape as tgt.

where TT is the target sequence length, SS is the source sequence length, NN is the batch size, and EE = d_model.

The TransformerDecoder is the natural building block for sequence-to-sequence tasks (machine translation, summarisation, speech synthesis) and for autoregressive language modelling when combined with an encoder that processes the conditioning context.

For autoregressive generation at inference time, the decoder is invoked step-by-step: at each step, the tgt tensor grows by one token (the previously generated token is appended), and the memory tensor remains fixed as the encoded source. A causal mask passed through tgt_mask ensures each position only attends to previously generated tokens.

Examples

**Six-layer decoder** (standard seq2seq decoder):
>>> import lucid
>>> import lucid.nn as nn
>>> d_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> norm = nn.LayerNorm(512)
>>> decoder = nn.TransformerDecoder(d_layer, num_layers=6, norm=norm)
>>> tgt = lucid.randn(15, 2, 512)       # (tgt_len, batch, d_model)
>>> memory = lucid.randn(30, 2, 512)    # (src_len, batch, d_model)
>>> out = decoder(tgt, memory)
>>> out.shape
(15, 2, 512)
**Decoder for autoregressive generation** (batch_first):
>>> d_layer = nn.TransformerDecoderLayer(
...     d_model=256, nhead=4, batch_first=True
... )
>>> decoder = nn.TransformerDecoder(d_layer, num_layers=4)
>>> memory = lucid.randn(2, 20, 256)    # encoder output
>>> # At step t, tgt contains the t tokens generated so far
>>> tgt = lucid.randn(2, 5, 256)
>>> out = decoder(tgt, memory, tgt_mask=None)
>>> out.shape
(2, 5, 256)

Methods (3)

dunder

__init__

None
__init__(decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Module | None = None)
source

Initialise the TransformerDecoder module. See the class docstring for parameter semantics.

fn

forward

Tensor
forward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None)
source

Run the forward pass of the module.

Parameters

tgtTensor
See the class docstring.
memoryTensor
See the class docstring.
tgt_maskTensor= None
See the class docstring.
memory_maskTensor= None
See the class docstring.
tgt_key_padding_maskTensor= None
See the class docstring.
memory_key_padding_maskTensor= None
See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.