fn

layer_norm

Tensor
layer_norm(x: Tensor, normalized_shape: list[int] | tuple[int, ...], weight: Tensor | None = None, bias: Tensor | None = None, eps: float = 1e-05)
source

Layer normalization (Ba, Kiros & Hinton, 2016).

Normalises each sample independently across the last normalized_shape dimensions. Unlike batch_norm, no batch statistics are involved — making LayerNorm the default choice for transformers and other models where batches may be small or sequence lengths variable.

Parameters

xTensor
Input whose trailing dims match normalized_shape.
normalized_shapelist of int or tuple of int
Trailing dims to normalise over. E.g. (d,) for a token-wise normalisation in a transformer with hidden size d.
weightTensor= None
Per-element scale γ\gamma of shape normalized_shape. Defaults to ones.
biasTensor= None
Per-element shift β\beta of shape normalized_shape. Defaults to zeros.
epsfloat= 1e-05
Numerical safety added inside the square root.

Returns

Tensor

Same shape as x.

Notes

Math (reduction taken over the last kk axes, k=len(normalized_shape)k = \mathrm{len}(\text{normalized\_shape})):

μ=1SjSxjσ2=1SjS(xjμ)2y=γxμσ2+ϵ+β\begin{aligned} \mu &= \frac{1}{|S|}\sum_{j \in S} x_j \\ \sigma^2 &= \frac{1}{|S|}\sum_{j \in S} (x_j - \mu)^2 \\ y &= \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \end{aligned}

Because the reduction is per-sample, behaviour is identical at train and eval time — no running statistics needed. This is what makes LayerNorm so prevalent in sequence models (RNNs, transformers).

Examples

>>> import lucid
>>> from lucid.nn.functional import layer_norm
>>> x = lucid.randn(2, 10, 64)
>>> y = layer_norm(x, normalized_shape=(64,))
>>> y.shape
(2, 10, 64)