class

LayerNorm

extendsModule
LayerNorm(normalized_shape: int | list[int] | tuple[int, ...], eps: float = 1e-05, elementwise_affine: bool = True, bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)
source

Layer normalization over the trailing dimensions of the input.

Normalises each sample independently by computing mean and variance over the axes defined by normalized_shape:

y=xμσ2+εγ+βy = \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}} \cdot \gamma + \beta

where μ\mu and σ2\sigma^2 are computed over the last len(normalized_shape) dimensions of the input tensor.

Unlike batch normalization, Layer Norm does not depend on the batch dimension, making it well-suited to variable-length sequences, transformer architectures, and settings where the batch size may be as small as 1.

Parameters

normalized_shapeint or list[int] or tuple[int, ...]
Shape of the trailing dimensions to normalize over. If an integer d is given it is treated as (d,), normalizing only the last axis. For a (N, T, C) input with normalized_shape=(C,) the mean and variance are computed independently for each (n, t) position.
epsfloat= 1e-05
Small constant added to the denominator for numerical stability. Default: 1e-5.
elementwise_affinebool= True
If True, learns per-element scale γ\gamma and (optionally) shift β\beta of shape normalized_shape. If False, no affine parameters are created and the output is purely normalized. Default: True.
biasbool= True
Only meaningful when elementwise_affine=True. If False, the module learns only a scale γ\gamma with no additive shift. Default: True.
deviceDeviceLike= None
Device on which to allocate the learnable parameters. Default: None (uses the default device).
dtypeDTypeLike= None
Data type of the learnable parameters. Default: None (inherits from the input).

Attributes

weightParameter or None
Learnable per-element scale γ\gamma of shape normalized_shape. None when elementwise_affine=False.
biasParameter or None
Learnable per-element shift β\beta of shape normalized_shape. None when elementwise_affine=False or bias=False.

Notes

  • Input: (,normalized_shape)(*, \text{normalized\_shape}) — any leading batch dimensions followed by the normalized trailing dimensions.
  • Output: same shape as the input.
  • The mean and variance are computed with correction=0 (biased estimator), consistent with the standard layer-norm convention.
  • Weights are initialised to 1 and biases to 0 so that the transformation is an identity at the start of training.
  • When using elementwise_affine=True, bias=False the module matches the "scale-only" layer norm used in some modern architectures.

Examples

Normalize the last dimension of a sequence model's hidden states:
>>> import lucid
>>> import lucid.nn as nn
>>> ln = nn.LayerNorm(512)
>>> x = lucid.randn(32, 64, 512)   # (batch, seq_len, hidden_dim)
>>> out = ln(x)
>>> out.shape
(32, 64, 512)
Normalize over multiple trailing dimensions (e.g. height and width):
>>> ln2d = nn.LayerNorm((28, 28))
>>> x2d = lucid.randn(8, 1, 28, 28)
>>> out2d = ln2d(x2d)
>>> out2d.shape
(8, 1, 28, 28)

Methods (3)

dunder

__init__

None
__init__(normalized_shape: int | list[int] | tuple[int, ...], eps: float = 1e-05, elementwise_affine: bool = True, bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the LayerNorm module. See the class docstring for parameter semantics.

fn

forward

Tensor
forward(x: Tensor)
source

Apply normalisation to the input tensor.

Parameters

inputTensor
Input tensor whose shape is documented in the class docstring.

Returns

Tensor

Normalised tensor of the same shape as input.

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.