class

TransformerEncoderLayer

extendsModule

TransformerEncoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Single transformer encoder layer: self-attention followed by a position-wise feed-forward network, with residual connections and layer normalisation.

This is one building block of the encoder stack described in "Attention Is All You Need" (Vaswani et al., 2017). A full encoder is formed by stacking $N$ copies of this layer (see TransformerEncoder).

Post-LN (original paper, default norm_first=False):

x = \text{LayerNorm}\!\left(x + \text{Dropout}(\text{SelfAttn}(x))\right)

x = \text{LayerNorm}\!\left(x + \text{Dropout}(\text{FFN}(x))\right)

Normalisation after the residual addition keeps the residual stream unnormalised, which can cause instability at the start of training for very deep models.

Pre-LN (norm_first=True):

x = x + \text{Dropout}(\text{SelfAttn}(\text{LayerNorm}(x)))

x = x + \text{Dropout}(\text{FFN}(\text{LayerNorm}(x)))

Normalising before the sub-layer keeps the residual stream on the identity path, which substantially improves gradient flow and allows training without learning-rate warm-up. Pre-LN is the default in most modern large-scale transformers (GPT-2, GPT-3, LLaMA, etc.).

Feed-forward network (FFN):

\text{FFN}(x) = \text{Linear}_2\!\left( \text{Dropout}(\sigma(\text{Linear}_1(x))) \right)

where $\sigma$ is either ReLU or GELU depending on activation. The inner dimension dim_feedforward is typically set to $4 \times d_{\text{model}}$ as in the original paper.

Parameters

d_modelint

Dimensionality of the model's hidden representations

d_{\text{model}}

. All sub-layers produce outputs of this width.

nheadint

Number of attention heads in the self-attention sub-layer. Must divide d_model evenly.

dim_feedforwardint= 2048

Inner (hidden) width of the two-layer FFN. Default: 2048.

dropoutfloat= 0.1

Dropout probability applied after each sub-layer output and inside the FFN activation. Default: 0.1.

activationstr= 'relu'

Non-linearity used inside the FFN. Supported values: "relu" (default) and "gelu".

batch_firstbool= False

If True, inputs and outputs are (batch, seq, feature). If False (default), they are (seq, batch, feature).

norm_firstbool= False

If True, applies Pre-LN (layer norm before each sub-layer). If False (default), applies Post-LN (layer norm after the residual addition).

deviceDeviceLike= None

Device for all sub-module parameters.

dtypeDTypeLike= None

Data type for all sub-module parameters.

Attributes

self_attnMultiheadAttention

Multi-head self-attention sub-layer.

linear1Linear

First linear layer of the FFN: d_model → dim_feedforward.

linear2Linear

Second linear layer of the FFN: dim_feedforward → d_model.

norm1LayerNorm

Layer normalisation applied around the self-attention sub-layer.

norm2LayerNorm

Layer normalisation applied around the FFN sub-layer.

dropout1Dropout

Dropout applied to the self-attention output before the residual addition.

dropout2Dropout

Dropout applied inside the FFN (after the activation).

dropout3Dropout

Dropout applied to the FFN output before the residual addition.

Notes

Input src: $(S, N, E)$ when batch_first=False, or $(N, S, E)$ when batch_first=True.
Output: same shape as src.

where $S$ is the source sequence length, $N$ is the batch size, and $E$ = d_model.

The three Dropout modules (dropout1, dropout2, dropout3) are each initialised with the same probability but are distinct instances. This ensures that each sub-layer's dropout mask is sampled independently, giving the model more regularisation diversity.

For inference, call model.eval() to disable all three dropout layers simultaneously through Lucid's Module.training flag.

Examples

**Basic Post-LN encoder layer** (default):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
>>> # Sequence-first layout: (seq_len, batch, d_model)
>>> src = lucid.randn(20, 4, 512)
>>> out = layer(src)
>>> out.shape
(20, 4, 512)
**Pre-LN variant with GELU and batch_first layout**:
>>> layer = nn.TransformerEncoderLayer(
...     d_model=256,
...     nhead=4,
...     dim_feedforward=1024,
...     dropout=0.0,
...     activation="gelu",
...     batch_first=True,
...     norm_first=True,
... )
>>> src = lucid.randn(2, 15, 256)       # (batch, seq, d_model)
>>> out = layer(src)
>>> out.shape
(2, 15, 256)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Initialise the TransformerEncoderLayer module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(src: Tensor, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, is_causal: bool = False)

source edit

Run the forward pass of the module.

Parameters

srcTensor

See the class docstring.

src_maskTensor= None

See the class docstring.

src_key_padding_maskTensor= None

See the class docstring.

is_causalTensor= False

See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

TransformerEncoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

**Basic Post-LN encoder layer** (default): >>> import lucid >>> import lucid.nn as nn >>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8) >>> # Sequence-first layout: (seq_len, batch, d_model) >>> src = lucid.randn(20, 4, 512) >>> out = layer(src) >>> out.shape (20, 4, 512) **Pre-LN variant with GELU and batch_first layout**: >>> layer = nn.TransformerEncoderLayer( ... d_model=256, ... nhead=4, ... dim_feedforward=1024, ... dropout=0.0, ... activation="gelu", ... batch_first=True, ... norm_first=True, ... ) >>> src = lucid.randn(2, 15, 256) # (batch, seq, d_model) >>> out = layer(src) >>> out.shape (2, 15, 256)

__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)