TransformerEncoderLayer
ModuleTransformerEncoderLayer(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)Single transformer encoder layer: self-attention followed by a position-wise feed-forward network, with residual connections and layer normalisation.
This is one building block of the encoder stack described in
"Attention Is All You Need" (Vaswani et al., 2017). A full encoder
is formed by stacking copies of this layer (see
TransformerEncoder).
Post-LN (original paper, default norm_first=False):
Normalisation after the residual addition keeps the residual stream unnormalised, which can cause instability at the start of training for very deep models.
Pre-LN (norm_first=True):
Normalising before the sub-layer keeps the residual stream on the identity path, which substantially improves gradient flow and allows training without learning-rate warm-up. Pre-LN is the default in most modern large-scale transformers (GPT-2, GPT-3, LLaMA, etc.).
Feed-forward network (FFN):
where is either ReLU or GELU depending on
activation. The inner dimension dim_feedforward is
typically set to as in the
original paper.
Parameters
d_modelintnheadintd_model evenly.dim_feedforwardint= 20482048.dropoutfloat= 0.10.1.activationstr= 'relu'"relu" (default) and "gelu".batch_firstbool= FalseTrue, inputs and outputs are (batch, seq, feature).
If False (default), they are (seq, batch, feature).norm_firstbool= FalseTrue, applies Pre-LN (layer norm before each
sub-layer). If False (default), applies Post-LN (layer
norm after the residual addition).deviceDeviceLike= NonedtypeDTypeLike= NoneAttributes
self_attnMultiheadAttentionlinear1Lineard_model → dim_feedforward.linear2Lineardim_feedforward → d_model.norm1LayerNormnorm2LayerNormdropout1Dropoutdropout2Dropoutdropout3DropoutNotes
- Input
src: whenbatch_first=False, or whenbatch_first=True. - Output: same shape as
src.
where is the source sequence length, is the
batch size, and = d_model.
The three Dropout modules (dropout1, dropout2,
dropout3) are each initialised with the same probability but
are distinct instances. This ensures that each sub-layer's dropout
mask is sampled independently, giving the model more regularisation
diversity.
For inference, call model.eval() to disable all three dropout
layers simultaneously through Lucid's Module.training flag.
Examples
**Basic Post-LN encoder layer** (default):
>>> import lucid
>>> import lucid.nn as nn
>>> layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
>>> # Sequence-first layout: (seq_len, batch, d_model)
>>> src = lucid.randn(20, 4, 512)
>>> out = layer(src)
>>> out.shape
(20, 4, 512)
**Pre-LN variant with GELU and batch_first layout**:
>>> layer = nn.TransformerEncoderLayer(
... d_model=256,
... nhead=4,
... dim_feedforward=1024,
... dropout=0.0,
... activation="gelu",
... batch_first=True,
... norm_first=True,
... )
>>> src = lucid.randn(2, 15, 256) # (batch, seq, d_model)
>>> out = layer(src)
>>> out.shape
(2, 15, 256)Methods (3)
__init__
→None__init__(d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1, activation: str = 'relu', batch_first: bool = False, norm_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)Initialise the TransformerEncoderLayer module. See the class docstring for parameter semantics.
forward
→Tensorforward(src: Tensor, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, is_causal: bool = False)Run the forward pass of the module.
Parameters
srcTensorsrc_maskTensor= Nonesrc_key_padding_maskTensor= Noneis_causalTensor= FalseReturns
TensorOutput tensor; refer to the class docstring for the exact shape.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.