class

TransformerDecoderLayer

extendsModule

TransformerDecoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Single transformer decoder layer: masked self-attention, cross-attention over encoder memory, and a position-wise FFN — each with residual connections and layer normalisation.

The decoder layer introduces a second attention sub-layer that attends to the encoder output (memory), connecting the generation process to the source context. The three sub-layers and their residual connections are:

Post-LN (default, norm_first=False):

x = \text{LayerNorm}\!\left( x + \text{Dropout}(\text{SelfAttn}(x,\, x,\, x)) \right) \quad \text{[masked / causal self-attention]}

x = \text{LayerNorm}\!\left( x + \text{Dropout}(\text{CrossAttn}(x,\, m,\, m)) \right) \quad \text{[cross-attention to encoder memory } m \text{]}

x = \text{LayerNorm}\!\left( x + \text{Dropout}(\text{FFN}(x)) \right)

Pre-LN (norm_first=True):

x = x + \text{Dropout}(\text{SelfAttn}(\text{LayerNorm}(x)))

x = x + \text{Dropout}(\text{CrossAttn}(\text{LayerNorm}(x),\, m,\, m))

x = x + \text{Dropout}(\text{FFN}(\text{LayerNorm}(x)))

where $m$ is the memory tensor produced by the encoder.

Feed-forward network (FFN):

\text{FFN}(x) = \text{Linear}_2\!\left( \text{Dropout}(\sigma(\text{Linear}_1(x))) \right)

Parameters

d_modelint

Dimensionality of the model's hidden representations.

nheadint

Number of attention heads in both the self-attention and the cross-attention sub-layers.

dim_feedforwardint= 2048

Inner width of the FFN. Default: 2048.

dropoutfloat= 0.1

Dropout probability for all sub-layer outputs and the FFN activation. Default: 0.1.

activationstr= 'relu'

Non-linearity inside the FFN: "relu" (default) or "gelu".

batch_firstbool= False

If True, all tensors use (batch, seq, feature) layout. Default: False (sequence-first).

norm_firstbool= False

If True, uses Pre-LN order. Default: False (Post-LN).

deviceDeviceLike= None

Device for all sub-module parameters.

dtypeDTypeLike= None

Data type for all sub-module parameters.

Attributes

self_attnMultiheadAttention

Masked (causal) self-attention over the target sequence.

multihead_attnMultiheadAttention

Cross-attention: queries from target, keys/values from memory.

linear1Linear

First FFN layer: d_model → dim_feedforward.

linear2Linear

Second FFN layer: dim_feedforward → d_model.

norm1LayerNorm

Normalisation for the self-attention sub-layer.

norm2LayerNorm

Normalisation for the cross-attention sub-layer.

norm3LayerNorm

Normalisation for the FFN sub-layer.

dropout1Dropout

Dropout after the self-attention output.

dropout2Dropout

Dropout inside the FFN (after activation).

dropout3Dropout

Dropout after the cross-attention output.

dropout4Dropout

Dropout after the FFN output.

Notes

tgt: $(T, N, E)$ when batch_first=False, or $(N, T, E)$ when batch_first=True.
memory: $(S, N, E)$ when batch_first=False, or $(N, S, E)$ when batch_first=True.
Output: same shape as tgt.

where $T$ is the target sequence length, $S$ is the source sequence length, $N$ is the batch size, and $E$ = d_model.

The self-attention sub-layer is typically used with tgt_is_causal=True during autoregressive decoding so that position $i$ cannot attend to any future position $j > i$ . This prevents information leakage from future tokens during teacher-forced training.

The cross-attention sub-layer uses the encoder output memory as both keys and values, while the decoder's current hidden state provides the queries. This is the mechanism by which the decoder conditions its generation on the source sequence.

Examples

**Basic decoder layer** (sequence-first layout):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> tgt = lucid.randn(10, 2, 512)       # (tgt_len, batch, d_model)
>>> memory = lucid.randn(20, 2, 512)    # (src_len, batch, d_model)
>>> out = layer(tgt, memory)
>>> out.shape
(10, 2, 512)
**Pre-LN decoder layer with causal self-attention**:
>>> layer = nn.TransformerDecoderLayer(
...     d_model=256, nhead=4, activation="gelu",
...     batch_first=True, norm_first=True,
... )
>>> tgt = lucid.randn(2, 8, 256)
>>> memory = lucid.randn(2, 12, 256)
>>> out = layer(tgt, memory, tgt_is_causal=True)
>>> out.shape
(2, 8, 256)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Initialise the TransformerDecoderLayer module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None, tgt_is_causal: bool = False, memory_is_causal: bool = False, past_key_value: Cache | None = None, layer_idx: int = 0, use_cache: bool = False, cache_position: Tensor | None = None)

source edit

Run the forward pass of the module.

Parameters

tgtTensor

See the class docstring.

memoryTensor

See the class docstring.

tgt_maskTensor= None

See the class docstring.

memory_maskTensor= None

See the class docstring.

tgt_key_padding_maskTensor= None

See the class docstring.

memory_key_padding_maskTensor= None

See the class docstring.

tgt_is_causalTensor= False

See the class docstring.

memory_is_causalTensor= False

See the class docstring.

past_key_value(Cache or None, optional, keyword - only)= None

Encoder-decoder cache; self-attention grows it, cross-attention fills it once from memory.

layer_idx(int, optional, keyword - only)= 0

Index of this layer within the cache.

use_cache(bool, optional, keyword - only)= False

Enable incremental KV caching.

cache_position(Tensor or None, optional, keyword - only)= None

Absolute positions of the new tgt tokens; accepted for parity.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

TransformerDecoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

**Basic decoder layer** (sequence-first layout): >>> import lucid >>> import lucid.nn as nn >>> layer = nn.TransformerDecoderLayer(d_model=512, nhead=8) >>> tgt = lucid.randn(10, 2, 512) # (tgt_len, batch, d_model) >>> memory = lucid.randn(20, 2, 512) # (src_len, batch, d_model) >>> out = layer(tgt, memory) >>> out.shape (10, 2, 512) **Pre-LN decoder layer with causal self-attention**: >>> layer = nn.TransformerDecoderLayer( ... d_model=256, nhead=4, activation="gelu", ... batch_first=True, norm_first=True, ... ) >>> tgt = lucid.randn(2, 8, 256) >>> memory = lucid.randn(2, 12, 256) >>> out = layer(tgt, memory, tgt_is_causal=True) >>> out.shape (2, 8, 256)

__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

forward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None, tgt_is_causal: bool = False, memory_is_causal: bool = False, past_key_value: Cache | None = None, layer_idx: int = 0, use_cache: bool = False, cache_position: Tensor | None = None)