class

TransformerEncoder

extendsModule

TransformerEncoder(encoder_layer: TransformerEncoderLayer, num_layers: int, norm: Module | None = None)

source edit

A stack of $N$ identical TransformerEncoderLayer modules.

The encoder maps a source sequence of continuous representations $(x_1, \ldots, x_S)$ into another sequence of the same shape. Each layer refines the representation by attending to all positions simultaneously (self-attention has no ordering constraint), allowing every position to gather context from the entire sequence in a single forward pass.

Formally, letting $\text{Layer}_i$ denote the $i$ -th encoder layer:

h^{(0)} = \text{src}

h^{(i)} = \text{Layer}_i\!\left(h^{(i-1)},\; \text{mask},\; \text{src\_key\_padding\_mask}\right), \quad i = 1, \ldots, N

\text{output} = \begin{cases} \text{LayerNorm}\!\left(h^{(N)}\right) & \text{if norm is set} \\ h^{(N)} & \text{otherwise} \end{cases}

Parameters

encoder_layerTransformerEncoderLayer

A single configured encoder layer whose hyperparameters (d_model, nhead, dim_feedforward, dropout, activation, batch_first, norm_first) are used to construct num_layers independent copies. The provided instance itself is not reused as one of the copies; a fresh layer is always instantiated per index.

num_layersint

Number of encoder layers

N

to stack.

normModule or None= None

Optional final normalisation module applied to the output of the last layer. Commonly a LayerNorm of size d_model. Default: None.

Attributes

layerslist[TransformerEncoderLayer]

The

N

encoder layer instances, also registered as sub-modules "0", "1", …, "N-1" for proper parameter tracking and serialisation.

normModule or None

The optional post-stack normalisation module.

num_layersint

The number of stacked layers.

Notes

Input src: $(S, N, E)$ when batch_first=False, or $(N, S, E)$ when batch_first=True.
Output: same shape as src.

where $S$ is the source sequence length, $N$ is the batch size, and $E$ = d_model.

The mask and src_key_padding_mask arguments are propagated unchanged to every layer in the stack. A boolean mask follows the convention that True indicates positions to ignore (mask out), matching the additive -inf semantics used internally.

When norm is provided it is registered as a sub-module named "norm", so its parameters appear in state_dict() and are updated by the optimiser.

Examples

**Six-layer encoder with final LayerNorm** (standard transformer):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8,
...                                    dim_feedforward=2048)
>>> norm = nn.LayerNorm(512)
>>> encoder = nn.TransformerEncoder(layer, num_layers=6, norm=norm)
>>> src = lucid.randn(30, 4, 512)       # (seq, batch, d_model)
>>> memory = encoder(src)
>>> memory.shape
(30, 4, 512)
**Encoder with source padding mask** (variable-length sequences):
>>> encoder = nn.TransformerEncoder(
...     nn.TransformerEncoderLayer(d_model=128, nhead=4,
...                                batch_first=True),
...     num_layers=3,
... )
>>> src = lucid.randn(2, 10, 128)
>>> # True = padding position to be ignored
>>> pad_mask = lucid.tensor([[False] * 7 + [True] * 3,
...                          [False] * 10])
>>> out = encoder(src, src_key_padding_mask=pad_mask)
>>> out.shape
(2, 10, 128)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(encoder_layer: TransformerEncoderLayer, num_layers: int, norm: Module | None = None)

source edit

Initialise the TransformerEncoder module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(src: Tensor, mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None)

source edit

Run the forward pass of the module.

Parameters

srcTensor

See the class docstring.

maskTensor= None

See the class docstring.

src_key_padding_maskTensor= None

See the class docstring.

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

**Six-layer encoder with final LayerNorm** (standard transformer): >>> import lucid >>> import lucid.nn as nn >>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, ... dim_feedforward=2048) >>> norm = nn.LayerNorm(512) >>> encoder = nn.TransformerEncoder(layer, num_layers=6, norm=norm) >>> src = lucid.randn(30, 4, 512) # (seq, batch, d_model) >>> memory = encoder(src) >>> memory.shape (30, 4, 512) **Encoder with source padding mask** (variable-length sequences): >>> encoder = nn.TransformerEncoder( ... nn.TransformerEncoderLayer(d_model=128, nhead=4, ... batch_first=True), ... num_layers=3, ... ) >>> src = lucid.randn(2, 10, 128) >>> # True = padding position to be ignored >>> pad_mask = lucid.tensor([[False] * 7 + [True] * 3, ... [False] * 10]) >>> out = encoder(src, src_key_padding_mask=pad_mask) >>> out.shape (2, 10, 128)