TransformerEncoder
ModuleTransformerEncoder(encoder_layer: TransformerEncoderLayer, num_layers: int, norm: Module | None = None)A stack of identical TransformerEncoderLayer modules.
The encoder maps a source sequence of continuous representations into another sequence of the same shape. Each layer refines the representation by attending to all positions simultaneously (self-attention has no ordering constraint), allowing every position to gather context from the entire sequence in a single forward pass.
Formally, letting denote the -th encoder layer:
Parameters
encoder_layerTransformerEncoderLayerd_model, nhead, dim_feedforward, dropout,
activation, batch_first, norm_first) are used
to construct num_layers independent copies. The provided
instance itself is not reused as one of the copies; a
fresh layer is always instantiated per index.num_layersintnormModule or None= NoneLayerNorm of size
d_model. Default: None.Attributes
layerslist[TransformerEncoderLayer]"0", "1", …, "N-1" for proper
parameter tracking and serialisation.normModule or Nonenum_layersintNotes
- Input
src: whenbatch_first=False, or whenbatch_first=True. - Output: same shape as
src.
where is the source sequence length, is the
batch size, and = d_model.
The mask and src_key_padding_mask arguments are propagated
unchanged to every layer in the stack. A boolean mask follows the
convention that True indicates positions to ignore (mask
out), matching the additive -inf semantics used internally.
When norm is provided it is registered as a sub-module named
"norm", so its parameters appear in state_dict() and are
updated by the optimiser.
Examples
**Six-layer encoder with final LayerNorm** (standard transformer):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8,
... dim_feedforward=2048)
>>> norm = nn.LayerNorm(512)
>>> encoder = nn.TransformerEncoder(layer, num_layers=6, norm=norm)
>>> src = lucid.randn(30, 4, 512) # (seq, batch, d_model)
>>> memory = encoder(src)
>>> memory.shape
(30, 4, 512)
**Encoder with source padding mask** (variable-length sequences):
>>> encoder = nn.TransformerEncoder(
... nn.TransformerEncoderLayer(d_model=128, nhead=4,
... batch_first=True),
... num_layers=3,
... )
>>> src = lucid.randn(2, 10, 128)
>>> # True = padding position to be ignored
>>> pad_mask = lucid.tensor([[False] * 7 + [True] * 3,
... [False] * 10])
>>> out = encoder(src, src_key_padding_mask=pad_mask)
>>> out.shape
(2, 10, 128)Methods (3)
__init__
→None__init__(encoder_layer: TransformerEncoderLayer, num_layers: int, norm: Module | None = None)Initialise the TransformerEncoder module. See the class docstring for parameter semantics.
forward
→Tensorforward(src: Tensor, mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None)Run the forward pass of the module.
Parameters
srcTensormaskTensor= Nonesrc_key_padding_maskTensor= NoneReturns
TensorOutput tensor; refer to the class docstring for the exact shape.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.