class

Transformer

extendsModule

Transformer(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Full encoder-decoder Transformer architecture.

Implements the complete sequence-to-sequence model introduced in "Attention Is All You Need" (Vaswani et al., 2017, arXiv:1706.03762). The model consists of:

An TransformerEncoder that encodes the source sequence into a continuous memory representation.
A TransformerDecoder that generates the target sequence conditioned on the encoder memory.

High-level data flow:

\text{memory} = \text{Encoder}(\text{src},\; \text{src\_mask},\; \text{src\_key\_padding\_mask})

\text{output} = \text{Decoder}(\text{tgt},\; \text{memory},\; \text{tgt\_mask},\; \text{memory\_mask},\; \text{tgt\_key\_padding\_mask},\; \text{memory\_key\_padding\_mask})

Both the encoder and decoder are built with a final LayerNorm applied after the last layer.

Parameters

d_modelint= 512

Dimensionality of all model representations

d_{\text{model}}

. Default: 512.

nheadint= 8

Number of attention heads in every multi-head attention sub-layer (encoder self-attention, decoder self-attention, and decoder cross-attention all use the same nhead). Default: 8.

num_encoder_layersint= 6

Number of layers in the encoder stack. Default: 6.

num_decoder_layersint= 6

Number of layers in the decoder stack. Default: 6.

dim_feedforwardint= 2048

Inner dimension of the position-wise FFN in each layer. Default: 2048.

dropoutfloat= 0.1

Dropout probability applied throughout the model. Default: 0.1.

activationstr= 'relu'

Activation function for all FFN layers: "relu" (default) or "gelu".

batch_firstbool= False

If True, all input and output tensors use (batch, seq, feature) layout. Default: False (sequence-first).

deviceDeviceLike= None

Device for all parameters.

dtypeDTypeLike= None

Data type for all parameters.

Attributes

encoderTransformerEncoder

The encoder module (num_encoder_layers stacked encoder layers + final LayerNorm).

decoderTransformerDecoder

The decoder module (num_decoder_layers stacked decoder layers + final LayerNorm).

d_modelint

Model dimension used at construction.

nheadint

Number of attention heads used at construction.

Notes

When batch_first=False (default):

src: $(S, N, E)$
tgt: $(T, N, E)$
Output: $(T, N, E)$

When batch_first=True:

src: $(N, S, E)$
tgt: $(N, T, E)$
Output: $(N, T, E)$

where $S$ is the source length, $T$ is the target length, $N$ is the batch size, and $E$ = d_model.

Positional encoding is not included in this module. The caller is responsible for adding positional information to src and tgt before passing them in. The standard approach is sinusoidal encodings (Vaswani et al.) or learned absolute / rotary position embeddings.

Masking conventions:

src_mask / tgt_mask / memory_mask: attention bias masks of shape (S, S), (T, T), or (T, S) respectively (or (N*H, ...) per-head variants). Boolean True = ignore that position.
*_key_padding_mask: boolean masks of shape (N, S) or (N, T) indicating padding positions (True = pad).

Initialisation: All projection weights use Xavier uniform initialisation (via MultiheadAttention) and all biases are zeroed. This matches the original paper's setup and provides stable gradient flow at the start of training.

Inference / generation: For autoregressive decoding, invoke the encoder once to obtain memory, then call the decoder iteratively with a growing tgt tensor and an appropriate causal mask. Each decoder call produces logits for the next token at the last position.

Examples

**Standard translation model** (default hyperparameters):
>>> import lucid
>>> import lucid.nn as nn
>>> model = nn.Transformer(d_model=512, nhead=8,
...                        num_encoder_layers=6,
...                        num_decoder_layers=6)
>>> src = lucid.randn(10, 2, 512)       # (src_len, batch, d_model)
>>> tgt = lucid.randn(7, 2, 512)        # (tgt_len, batch, d_model)
>>> out = model(src, tgt)
>>> out.shape
(7, 2, 512)
**Compact model with batch_first and GELU**:
>>> model = nn.Transformer(
...     d_model=256, nhead=4,
...     num_encoder_layers=3, num_decoder_layers=3,
...     dim_feedforward=512, dropout=0.0,
...     activation="gelu", batch_first=True,
... )
>>> src = lucid.randn(2, 20, 256)
>>> tgt = lucid.randn(2, 15, 256)
>>> out = model(src, tgt)
>>> out.shape
(2, 15, 256)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Initialise the Transformer module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(src: Tensor, tgt: Tensor, src_mask: Tensor | None = None, tgt_mask: Tensor | None = None, memory_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, tgt_key_padding_mask: Tensor | None = None, memory_key_padding_mask: Tensor | None = None)

source edit

Run the forward pass of the module.

Parameters

srcTensor

See the class docstring.

tgtTensor

See the class docstring.

src_maskTensor= None

See the class docstring.

tgt_maskTensor= None

See the class docstring.

memory_maskTensor= None

See the class docstring.

src_key_padding_maskTensor= None

See the class docstring.

tgt_key_padding_maskTensor= None

See the class docstring.

memory_key_padding_maskTensor= None

See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

Transformer(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

**Standard translation model** (default hyperparameters): >>> import lucid >>> import lucid.nn as nn >>> model = nn.Transformer(d_model=512, nhead=8, ... num_encoder_layers=6, ... num_decoder_layers=6) >>> src = lucid.randn(10, 2, 512) # (src_len, batch, d_model) >>> tgt = lucid.randn(7, 2, 512) # (tgt_len, batch, d_model) >>> out = model(src, tgt) >>> out.shape (7, 2, 512) **Compact model with batch_first and GELU**: >>> model = nn.Transformer( ... d_model=256, nhead=4, ... num_encoder_layers=3, num_decoder_layers=3, ... dim_feedforward=512, dropout=0.0, ... activation="gelu", batch_first=True, ... ) >>> src = lucid.randn(2, 20, 256) >>> tgt = lucid.randn(2, 15, 256) >>> out = model(src, tgt) >>> out.shape (2, 15, 256)

__init__(d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)