TransformerDecoderLayer
ModuleTransformerDecoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)Single transformer decoder layer: masked self-attention, cross-attention over encoder memory, and a position-wise FFN — each with residual connections and layer normalisation.
The decoder layer introduces a second attention sub-layer that
attends to the encoder output (memory), connecting the
generation process to the source context. The three sub-layers
and their residual connections are:
Post-LN (default, norm_first=False):
Pre-LN (norm_first=True):
where is the memory tensor produced by the encoder.
Feed-forward network (FFN):
Parameters
d_modelintnheadintdim_feedforwardint= 20482048.dropoutfloat= 0.10.1.activationstr= 'relu'"relu" (default) or
"gelu".batch_firstbool= FalseTrue, all tensors use (batch, seq, feature) layout.
Default: False (sequence-first).norm_firstbool= FalseTrue, uses Pre-LN order. Default: False (Post-LN).deviceDeviceLike= NonedtypeDTypeLike= NoneAttributes
self_attnMultiheadAttentionmultihead_attnMultiheadAttentionlinear1Lineard_model → dim_feedforward.linear2Lineardim_feedforward → d_model.norm1LayerNormnorm2LayerNormnorm3LayerNormdropout1Dropoutdropout2Dropoutdropout3Dropoutdropout4DropoutNotes
tgt: whenbatch_first=False, or whenbatch_first=True.memory: whenbatch_first=False, or whenbatch_first=True.- Output: same shape as
tgt.
where is the target sequence length, is the
source sequence length, is the batch size, and
= d_model.
The self-attention sub-layer is typically used with
tgt_is_causal=True during autoregressive decoding so that
position cannot attend to any future position
. This prevents information leakage from future
tokens during teacher-forced training.
The cross-attention sub-layer uses the encoder output memory
as both keys and values, while the decoder's current hidden state
provides the queries. This is the mechanism by which the decoder
conditions its generation on the source sequence.
Examples
**Basic decoder layer** (sequence-first layout):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> tgt = lucid.randn(10, 2, 512) # (tgt_len, batch, d_model)
>>> memory = lucid.randn(20, 2, 512) # (src_len, batch, d_model)
>>> out = layer(tgt, memory)
>>> out.shape
(10, 2, 512)
**Pre-LN decoder layer with causal self-attention**:
>>> layer = nn.TransformerDecoderLayer(
... d_model=256, nhead=4, activation="gelu",
... batch_first=True, norm_first=True,
... )
>>> tgt = lucid.randn(2, 8, 256)
>>> memory = lucid.randn(2, 12, 256)
>>> out = layer(tgt, memory, tgt_is_causal=True)
>>> out.shape
(2, 8, 256)Methods (3)
__init__
→None__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)Initialise the TransformerDecoderLayer module. See the class docstring for parameter semantics.
forward
→Tensorforward(tgt: Tensor, memory: Tensor, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None, tgt_is_causal: bool = False, memory_is_causal: bool = False)Run the forward pass of the module.
Parameters
tgtTensormemoryTensortgt_maskTensor= Nonememory_maskTensor= Nonetgt_key_padding_maskTensor= Nonememory_key_padding_maskTensor= Nonetgt_is_causalTensor= Falsememory_is_causalTensor= FalseReturns
TensorOutput tensor; refer to the class docstring for the exact shape.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.