class

Transformer

extendsModule
Transformer(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)
source

Full encoder-decoder Transformer architecture.

Implements the complete sequence-to-sequence model introduced in "Attention Is All You Need" (Vaswani et al., 2017, arXiv:1706.03762). The model consists of:

  1. An TransformerEncoder that encodes the source sequence into a continuous memory representation.
  2. A TransformerDecoder that generates the target sequence conditioned on the encoder memory.

High-level data flow:

memory=Encoder(src,  src_mask,  src_key_padding_mask)\text{memory} = \text{Encoder}(\text{src},\; \text{src\_mask},\; \text{src\_key\_padding\_mask}) output=Decoder(tgt,  memory,  tgt_mask,  memory_mask,  tgt_key_padding_mask,  memory_key_padding_mask)\text{output} = \text{Decoder}(\text{tgt},\; \text{memory},\; \text{tgt\_mask},\; \text{memory\_mask},\; \text{tgt\_key\_padding\_mask},\; \text{memory\_key\_padding\_mask})

Both the encoder and decoder are built with a final LayerNorm applied after the last layer.

Parameters

d_modelint= 512
Dimensionality of all model representations dmodeld_{\text{model}}. Default: 512.
nheadint= 8
Number of attention heads in every multi-head attention sub-layer (encoder self-attention, decoder self-attention, and decoder cross-attention all use the same nhead). Default: 8.
num_encoder_layersint= 6
Number of layers in the encoder stack. Default: 6.
num_decoder_layersint= 6
Number of layers in the decoder stack. Default: 6.
dim_feedforwardint= 2048
Inner dimension of the position-wise FFN in each layer. Default: 2048.
dropoutfloat= 0.1
Dropout probability applied throughout the model. Default: 0.1.
activationstr= 'relu'
Activation function for all FFN layers: "relu" (default) or "gelu".
batch_firstbool= False
If True, all input and output tensors use (batch, seq, feature) layout. Default: False (sequence-first).
deviceDeviceLike= None
Device for all parameters.
dtypeDTypeLike= None
Data type for all parameters.

Attributes

encoderTransformerEncoder
The encoder module (num_encoder_layers stacked encoder layers + final LayerNorm).
decoderTransformerDecoder
The decoder module (num_decoder_layers stacked decoder layers + final LayerNorm).
d_modelint
Model dimension used at construction.
nheadint
Number of attention heads used at construction.

Notes

When batch_first=False (default):

  • src: (S,N,E)(S, N, E)
  • tgt: (T,N,E)(T, N, E)
  • Output: (T,N,E)(T, N, E)

When batch_first=True:

  • src: (N,S,E)(N, S, E)
  • tgt: (N,T,E)(N, T, E)
  • Output: (N,T,E)(N, T, E)

where SS is the source length, TT is the target length, NN is the batch size, and EE = d_model.

Positional encoding is not included in this module. The caller is responsible for adding positional information to src and tgt before passing them in. The standard approach is sinusoidal encodings (Vaswani et al.) or learned absolute / rotary position embeddings.

Masking conventions:

  • src_mask / tgt_mask / memory_mask: attention bias masks of shape (S, S), (T, T), or (T, S) respectively (or (N*H, ...) per-head variants). Boolean True = ignore that position.
  • *_key_padding_mask: boolean masks of shape (N, S) or (N, T) indicating padding positions (True = pad).

Initialisation: All projection weights use Xavier uniform initialisation (via MultiheadAttention) and all biases are zeroed. This matches the original paper's setup and provides stable gradient flow at the start of training.

Inference / generation: For autoregressive decoding, invoke the encoder once to obtain memory, then call the decoder iteratively with a growing tgt tensor and an appropriate causal mask. Each decoder call produces logits for the next token at the last position.

Examples

**Standard translation model** (default hyperparameters):
>>> import lucid
>>> import lucid.nn as nn
>>> model = nn.Transformer(d_model=512, nhead=8,
...                        num_encoder_layers=6,
...                        num_decoder_layers=6)
>>> src = lucid.randn(10, 2, 512)       # (src_len, batch, d_model)
>>> tgt = lucid.randn(7, 2, 512)        # (tgt_len, batch, d_model)
>>> out = model(src, tgt)
>>> out.shape
(7, 2, 512)
**Compact model with batch_first and GELU**:
>>> model = nn.Transformer(
...     d_model=256, nhead=4,
...     num_encoder_layers=3, num_decoder_layers=3,
...     dim_feedforward=512, dropout=0.0,
...     activation="gelu", batch_first=True,
... )
>>> src = lucid.randn(2, 20, 256)
>>> tgt = lucid.randn(2, 15, 256)
>>> out = model(src, tgt)
>>> out.shape
(2, 15, 256)

Methods (3)

dunder

__init__

None
__init__(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the Transformer module. See the class docstring for parameter semantics.

fn

forward

Tensor
forward(src: Tensor, tgt: Tensor, src_mask: Tensor | None = None, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None)
source

Run the forward pass of the module.

Parameters

srcTensor
See the class docstring.
tgtTensor
See the class docstring.
src_maskTensor= None
See the class docstring.
tgt_maskTensor= None
See the class docstring.
memory_maskTensor= None
See the class docstring.
src_key_padding_maskTensor= None
See the class docstring.
tgt_key_padding_maskTensor= None
See the class docstring.
memory_key_padding_maskTensor= None
See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.