class

TransformerEncoderLayer

extendsModule
TransformerEncoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)
source

Single transformer encoder layer: self-attention followed by a position-wise feed-forward network, with residual connections and layer normalisation.

This is one building block of the encoder stack described in "Attention Is All You Need" (Vaswani et al., 2017). A full encoder is formed by stacking NN copies of this layer (see TransformerEncoder).

Post-LN (original paper, default norm_first=False):

x=LayerNorm ⁣(x+Dropout(SelfAttn(x)))x = \text{LayerNorm}\!\left(x + \text{Dropout}(\text{SelfAttn}(x))\right) x=LayerNorm ⁣(x+Dropout(FFN(x)))x = \text{LayerNorm}\!\left(x + \text{Dropout}(\text{FFN}(x))\right)

Normalisation after the residual addition keeps the residual stream unnormalised, which can cause instability at the start of training for very deep models.

Pre-LN (norm_first=True):

x=x+Dropout(SelfAttn(LayerNorm(x)))x = x + \text{Dropout}(\text{SelfAttn}(\text{LayerNorm}(x))) x=x+Dropout(FFN(LayerNorm(x)))x = x + \text{Dropout}(\text{FFN}(\text{LayerNorm}(x)))

Normalising before the sub-layer keeps the residual stream on the identity path, which substantially improves gradient flow and allows training without learning-rate warm-up. Pre-LN is the default in most modern large-scale transformers (GPT-2, GPT-3, LLaMA, etc.).

Feed-forward network (FFN):

FFN(x)=Linear2 ⁣(Dropout(σ(Linear1(x))))\text{FFN}(x) = \text{Linear}_2\!\left( \text{Dropout}(\sigma(\text{Linear}_1(x))) \right)

where σ\sigma is either ReLU or GELU depending on activation. The inner dimension dim_feedforward is typically set to 4×dmodel4 \times d_{\text{model}} as in the original paper.

Parameters

d_modelint
Dimensionality of the model's hidden representations dmodeld_{\text{model}}. All sub-layers produce outputs of this width.
nheadint
Number of attention heads in the self-attention sub-layer. Must divide d_model evenly.
dim_feedforwardint= 2048
Inner (hidden) width of the two-layer FFN. Default: 2048.
dropoutfloat= 0.1
Dropout probability applied after each sub-layer output and inside the FFN activation. Default: 0.1.
activationstr= 'relu'
Non-linearity used inside the FFN. Supported values: "relu" (default) and "gelu".
batch_firstbool= False
If True, inputs and outputs are (batch, seq, feature). If False (default), they are (seq, batch, feature).
norm_firstbool= False
If True, applies Pre-LN (layer norm before each sub-layer). If False (default), applies Post-LN (layer norm after the residual addition).
deviceDeviceLike= None
Device for all sub-module parameters.
dtypeDTypeLike= None
Data type for all sub-module parameters.

Attributes

self_attnMultiheadAttention
Multi-head self-attention sub-layer.
linear1Linear
First linear layer of the FFN: d_model → dim_feedforward.
linear2Linear
Second linear layer of the FFN: dim_feedforward → d_model.
norm1LayerNorm
Layer normalisation applied around the self-attention sub-layer.
norm2LayerNorm
Layer normalisation applied around the FFN sub-layer.
dropout1Dropout
Dropout applied to the self-attention output before the residual addition.
dropout2Dropout
Dropout applied inside the FFN (after the activation).
dropout3Dropout
Dropout applied to the FFN output before the residual addition.

Notes

  • Input src: (S,N,E)(S, N, E) when batch_first=False, or (N,S,E)(N, S, E) when batch_first=True.
  • Output: same shape as src.

where SS is the source sequence length, NN is the batch size, and EE = d_model.

The three Dropout modules (dropout1, dropout2, dropout3) are each initialised with the same probability but are distinct instances. This ensures that each sub-layer's dropout mask is sampled independently, giving the model more regularisation diversity.

For inference, call model.eval() to disable all three dropout layers simultaneously through Lucid's Module.training flag.

Examples

**Basic Post-LN encoder layer** (default):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
>>> # Sequence-first layout: (seq_len, batch, d_model)
>>> src = lucid.randn(20, 4, 512)
>>> out = layer(src)
>>> out.shape
(20, 4, 512)
**Pre-LN variant with GELU and batch_first layout**:
>>> layer = nn.TransformerEncoderLayer(
...     d_model=256,
...     nhead=4,
...     dim_feedforward=1024,
...     dropout=0.0,
...     activation="gelu",
...     batch_first=True,
...     norm_first=True,
... )
>>> src = lucid.randn(2, 15, 256)       # (batch, seq, d_model)
>>> out = layer(src)
>>> out.shape
(2, 15, 256)

Methods (3)

dunder

__init__

None
__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the TransformerEncoderLayer module. See the class docstring for parameter semantics.

fn

forward

Tensor
forward(src: Tensor, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, is_causal: bool = False)
source

Run the forward pass of the module.

Parameters

srcTensor
See the class docstring.
src_maskTensor= None
See the class docstring.
src_key_padding_maskTensor= None
See the class docstring.
is_causalTensor= False
See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.