Transformer
ModuleTransformer(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)Full encoder-decoder Transformer architecture.
Implements the complete sequence-to-sequence model introduced in "Attention Is All You Need" (Vaswani et al., 2017, arXiv:1706.03762). The model consists of:
- An
TransformerEncoderthat encodes the source sequence into a continuous memory representation. - A
TransformerDecoderthat generates the target sequence conditioned on the encoder memory.
High-level data flow:
Both the encoder and decoder are built with a final
LayerNorm applied after the last layer.
Parameters
d_modelint= 512512.nheadint= 8nhead).
Default: 8.num_encoder_layersint= 66.num_decoder_layersint= 66.dim_feedforwardint= 20482048.dropoutfloat= 0.10.1.activationstr= 'relu'"relu" (default)
or "gelu".batch_firstbool= FalseTrue, all input and output tensors use
(batch, seq, feature) layout. Default: False
(sequence-first).deviceDeviceLike= NonedtypeDTypeLike= NoneAttributes
encoderTransformerEncodernum_encoder_layers stacked encoder
layers + final LayerNorm).decoderTransformerDecodernum_decoder_layers stacked decoder
layers + final LayerNorm).d_modelintnheadintNotes
When batch_first=False (default):
src:tgt:- Output:
When batch_first=True:
src:tgt:- Output:
where is the source length, is the target
length, is the batch size, and = d_model.
Positional encoding is not included in this module.
The caller is responsible for adding positional information to
src and tgt before passing them in. The standard approach
is sinusoidal encodings (Vaswani et al.) or learned absolute /
rotary position embeddings.
Masking conventions:
src_mask/tgt_mask/memory_mask: attention bias masks of shape(S, S),(T, T), or(T, S)respectively (or(N*H, ...)per-head variants). BooleanTrue= ignore that position.*_key_padding_mask: boolean masks of shape(N, S)or(N, T)indicating padding positions (True= pad).
Initialisation: All projection weights use Xavier uniform
initialisation (via MultiheadAttention) and all biases
are zeroed. This matches the original paper's setup and provides
stable gradient flow at the start of training.
Inference / generation: For autoregressive decoding, invoke
the encoder once to obtain memory, then call the decoder
iteratively with a growing tgt tensor and an appropriate
causal mask. Each decoder call produces logits for the next
token at the last position.
Examples
**Standard translation model** (default hyperparameters):
>>> import lucid
>>> import lucid.nn as nn
>>> model = nn.Transformer(d_model=512, nhead=8,
... num_encoder_layers=6,
... num_decoder_layers=6)
>>> src = lucid.randn(10, 2, 512) # (src_len, batch, d_model)
>>> tgt = lucid.randn(7, 2, 512) # (tgt_len, batch, d_model)
>>> out = model(src, tgt)
>>> out.shape
(7, 2, 512)
**Compact model with batch_first and GELU**:
>>> model = nn.Transformer(
... d_model=256, nhead=4,
... num_encoder_layers=3, num_decoder_layers=3,
... dim_feedforward=512, dropout=0.0,
... activation="gelu", batch_first=True,
... )
>>> src = lucid.randn(2, 20, 256)
>>> tgt = lucid.randn(2, 15, 256)
>>> out = model(src, tgt)
>>> out.shape
(2, 15, 256)Methods (3)
__init__
→None__init__(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)Initialise the Transformer module. See the class docstring for parameter semantics.
forward
→Tensorforward(src: Tensor, tgt: Tensor, src_mask: Tensor | None = None, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None)Run the forward pass of the module.
Parameters
srcTensortgtTensorsrc_maskTensor= Nonetgt_maskTensor= Nonememory_maskTensor= Nonesrc_key_padding_maskTensor= Nonetgt_key_padding_maskTensor= Nonememory_key_padding_maskTensor= NoneReturns
TensorOutput tensor; refer to the class docstring for the exact shape.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.