class
LayerNorm
extends
ModuleLayerNorm(normalized_shape: int | list[int] | tuple[int, ...], eps: float = 1e-05, elementwise_affine: bool = True, bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)Layer normalization over the trailing dimensions of the input.
Normalises each sample independently by computing mean and variance
over the axes defined by normalized_shape:
where and are computed over the last
len(normalized_shape) dimensions of the input tensor.
Unlike batch normalization, Layer Norm does not depend on the batch dimension, making it well-suited to variable-length sequences, transformer architectures, and settings where the batch size may be as small as 1.
Parameters
normalized_shapeint or list[int] or tuple[int, ...]Shape of the trailing dimensions to normalize over. If an
integer
d is given it is treated as (d,), normalizing
only the last axis. For a (N, T, C) input with
normalized_shape=(C,) the mean and variance are computed
independently for each (n, t) position.epsfloat= 1e-05Small constant added to the denominator for numerical stability.
Default:
1e-5.elementwise_affinebool= TrueIf
True, learns per-element scale and
(optionally) shift of shape normalized_shape.
If False, no affine parameters are created and the output is
purely normalized. Default: True.biasbool= TrueOnly meaningful when
elementwise_affine=True. If False,
the module learns only a scale with no additive
shift. Default: True.deviceDeviceLike= NoneDevice on which to allocate the learnable parameters.
Default:
None (uses the default device).dtypeDTypeLike= NoneData type of the learnable parameters. Default:
None
(inherits from the input).Attributes
weightParameter or NoneLearnable per-element scale of shape
normalized_shape. None when
elementwise_affine=False.biasParameter or NoneLearnable per-element shift of shape
normalized_shape. None when
elementwise_affine=False or bias=False.Notes
- Input: — any leading batch dimensions followed by the normalized trailing dimensions.
- Output: same shape as the input.
- The mean and variance are computed with
correction=0(biased estimator), consistent with the standard layer-norm convention. - Weights are initialised to 1 and biases to 0 so that the transformation is an identity at the start of training.
- When using
elementwise_affine=True, bias=Falsethe module matches the "scale-only" layer norm used in some modern architectures.
Examples
Normalize the last dimension of a sequence model's hidden states:
>>> import lucid
>>> import lucid.nn as nn
>>> ln = nn.LayerNorm(512)
>>> x = lucid.randn(32, 64, 512) # (batch, seq_len, hidden_dim)
>>> out = ln(x)
>>> out.shape
(32, 64, 512)
Normalize over multiple trailing dimensions (e.g. height and width):
>>> ln2d = nn.LayerNorm((28, 28))
>>> x2d = lucid.randn(8, 1, 28, 28)
>>> out2d = ln2d(x2d)
>>> out2d.shape
(8, 1, 28, 28)Methods (3)
dunder
__init__
→None__init__(normalized_shape: int | list[int] | tuple[int, ...], eps: float = 1e-05, elementwise_affine: bool = True, bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)Initialise the LayerNorm module. See the class docstring for parameter semantics.
fn
forward
→Tensorforward(x: Tensor)Apply normalisation to the input tensor.
Parameters
inputTensorInput tensor whose shape is documented in the class docstring.
Returns
TensorNormalised tensor of the same shape as input.
fn
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.