class

TransformerDecoderLayer

extendsModule
TransformerDecoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)
source

Single transformer decoder layer: masked self-attention, cross-attention over encoder memory, and a position-wise FFN — each with residual connections and layer normalisation.

The decoder layer introduces a second attention sub-layer that attends to the encoder output (memory), connecting the generation process to the source context. The three sub-layers and their residual connections are:

Post-LN (default, norm_first=False):

x=LayerNorm ⁣(x+Dropout(SelfAttn(x,x,x)))[masked / causal self-attention]x = \text{LayerNorm}\!\left( x + \text{Dropout}(\text{SelfAttn}(x,\, x,\, x)) \right) \quad \text{[masked / causal self-attention]} x=LayerNorm ⁣(x+Dropout(CrossAttn(x,m,m)))[cross-attention to encoder memory m]x = \text{LayerNorm}\!\left( x + \text{Dropout}(\text{CrossAttn}(x,\, m,\, m)) \right) \quad \text{[cross-attention to encoder memory } m \text{]} x=LayerNorm ⁣(x+Dropout(FFN(x)))x = \text{LayerNorm}\!\left( x + \text{Dropout}(\text{FFN}(x)) \right)

Pre-LN (norm_first=True):

x=x+Dropout(SelfAttn(LayerNorm(x)))x = x + \text{Dropout}(\text{SelfAttn}(\text{LayerNorm}(x))) x=x+Dropout(CrossAttn(LayerNorm(x),m,m))x = x + \text{Dropout}(\text{CrossAttn}(\text{LayerNorm}(x),\, m,\, m)) x=x+Dropout(FFN(LayerNorm(x)))x = x + \text{Dropout}(\text{FFN}(\text{LayerNorm}(x)))

where mm is the memory tensor produced by the encoder.

Feed-forward network (FFN):

FFN(x)=Linear2 ⁣(Dropout(σ(Linear1(x))))\text{FFN}(x) = \text{Linear}_2\!\left( \text{Dropout}(\sigma(\text{Linear}_1(x))) \right)

Parameters

d_modelint
Dimensionality of the model's hidden representations.
nheadint
Number of attention heads in both the self-attention and the cross-attention sub-layers.
dim_feedforwardint= 2048
Inner width of the FFN. Default: 2048.
dropoutfloat= 0.1
Dropout probability for all sub-layer outputs and the FFN activation. Default: 0.1.
activationstr= 'relu'
Non-linearity inside the FFN: "relu" (default) or "gelu".
batch_firstbool= False
If True, all tensors use (batch, seq, feature) layout. Default: False (sequence-first).
norm_firstbool= False
If True, uses Pre-LN order. Default: False (Post-LN).
deviceDeviceLike= None
Device for all sub-module parameters.
dtypeDTypeLike= None
Data type for all sub-module parameters.

Attributes

self_attnMultiheadAttention
Masked (causal) self-attention over the target sequence.
multihead_attnMultiheadAttention
Cross-attention: queries from target, keys/values from memory.
linear1Linear
First FFN layer: d_model → dim_feedforward.
linear2Linear
Second FFN layer: dim_feedforward → d_model.
norm1LayerNorm
Normalisation for the self-attention sub-layer.
norm2LayerNorm
Normalisation for the cross-attention sub-layer.
norm3LayerNorm
Normalisation for the FFN sub-layer.
dropout1Dropout
Dropout after the self-attention output.
dropout2Dropout
Dropout inside the FFN (after activation).
dropout3Dropout
Dropout after the cross-attention output.
dropout4Dropout
Dropout after the FFN output.

Notes

  • tgt: (T,N,E)(T, N, E) when batch_first=False, or (N,T,E)(N, T, E) when batch_first=True.
  • memory: (S,N,E)(S, N, E) when batch_first=False, or (N,S,E)(N, S, E) when batch_first=True.
  • Output: same shape as tgt.

where TT is the target sequence length, SS is the source sequence length, NN is the batch size, and EE = d_model.

The self-attention sub-layer is typically used with tgt_is_causal=True during autoregressive decoding so that position ii cannot attend to any future position j>ij > i. This prevents information leakage from future tokens during teacher-forced training.

The cross-attention sub-layer uses the encoder output memory as both keys and values, while the decoder's current hidden state provides the queries. This is the mechanism by which the decoder conditions its generation on the source sequence.

Examples

**Basic decoder layer** (sequence-first layout):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> tgt = lucid.randn(10, 2, 512)       # (tgt_len, batch, d_model)
>>> memory = lucid.randn(20, 2, 512)    # (src_len, batch, d_model)
>>> out = layer(tgt, memory)
>>> out.shape
(10, 2, 512)
**Pre-LN decoder layer with causal self-attention**:
>>> layer = nn.TransformerDecoderLayer(
...     d_model=256, nhead=4, activation="gelu",
...     batch_first=True, norm_first=True,
... )
>>> tgt = lucid.randn(2, 8, 256)
>>> memory = lucid.randn(2, 12, 256)
>>> out = layer(tgt, memory, tgt_is_causal=True)
>>> out.shape
(2, 8, 256)

Methods (3)

dunder

__init__

None
__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the TransformerDecoderLayer module. See the class docstring for parameter semantics.

fn

forward

Tensor
forward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None, tgt_is_causal: bool = False, memory_is_causal: bool = False)
source

Run the forward pass of the module.

Parameters

tgtTensor
See the class docstring.
memoryTensor
See the class docstring.
tgt_maskTensor= None
See the class docstring.
memory_maskTensor= None
See the class docstring.
tgt_key_padding_maskTensor= None
See the class docstring.
memory_key_padding_maskTensor= None
See the class docstring.
tgt_is_causalTensor= False
See the class docstring.
memory_is_causalTensor= False
See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.