TransformerDecoder
ModuleTransformerDecoder(decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Module | None = None)A stack of identical TransformerDecoderLayer modules.
The decoder takes a target sequence and a memory tensor produced
by the encoder and generates a sequence of hidden representations
that are conditioned on both. Every decoder layer receives the
same memory as its cross-attention context, allowing the
decoder to attend to all encoder positions at each decoding step.
Formally, letting denote the -th decoder layer:
Parameters
decoder_layerTransformerDecoderLayernum_layers independent instances.
The provided layer itself is not reused.num_layersintnormModule or None= NoneNone.Attributes
layerslist[TransformerDecoderLayer]"0", "1", …, "N-1".normModule or Nonenum_layersintNotes
tgt: whenbatch_first=False, or whenbatch_first=True.memory: whenbatch_first=False, or whenbatch_first=True.- Output: same shape as
tgt.
where is the target sequence length, is the
source sequence length, is the batch size, and
= d_model.
The TransformerDecoder is the natural building block for
sequence-to-sequence tasks (machine translation, summarisation,
speech synthesis) and for autoregressive language modelling when
combined with an encoder that processes the conditioning context.
For autoregressive generation at inference time, the decoder is
invoked step-by-step: at each step, the tgt tensor grows by
one token (the previously generated token is appended), and the
memory tensor remains fixed as the encoded source. A causal
mask passed through tgt_mask ensures each position only
attends to previously generated tokens.
Examples
**Six-layer decoder** (standard seq2seq decoder):
>>> import lucid
>>> import lucid.nn as nn
>>> d_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> norm = nn.LayerNorm(512)
>>> decoder = nn.TransformerDecoder(d_layer, num_layers=6, norm=norm)
>>> tgt = lucid.randn(15, 2, 512) # (tgt_len, batch, d_model)
>>> memory = lucid.randn(30, 2, 512) # (src_len, batch, d_model)
>>> out = decoder(tgt, memory)
>>> out.shape
(15, 2, 512)
**Decoder for autoregressive generation** (batch_first):
>>> d_layer = nn.TransformerDecoderLayer(
... d_model=256, nhead=4, batch_first=True
... )
>>> decoder = nn.TransformerDecoder(d_layer, num_layers=4)
>>> memory = lucid.randn(2, 20, 256) # encoder output
>>> # At step t, tgt contains the t tokens generated so far
>>> tgt = lucid.randn(2, 5, 256)
>>> out = decoder(tgt, memory, tgt_mask=None)
>>> out.shape
(2, 5, 256)Methods (3)
__init__
→None__init__(decoder_layer: TransformerDecoderLayer, num_layers: int, norm: Module | None = None)Initialise the TransformerDecoder module. See the class docstring for parameter semantics.
forward
→Tensorforward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None)Run the forward pass of the module.
Parameters
tgtTensormemoryTensortgt_maskTensor= Nonememory_maskTensor= Nonetgt_key_padding_maskTensor= Nonememory_key_padding_maskTensor= NoneReturns
TensorOutput tensor; refer to the class docstring for the exact shape.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.