layer_norm

→Tensor

layer_norm(x: Tensor, normalized_shape: list[int] | tuple[int, ...], weight: Tensor | None = None, bias: Tensor | None = None, eps: float = 1e-05)

source edit

Implementing kernel

Layer normalization (Ba, Kiros & Hinton, 2016).

Normalises each sample independently across the last normalized_shape dimensions. Unlike batch_norm, no batch statistics are involved — making LayerNorm the default choice for transformers and other models where batches may be small or sequence lengths variable.

Parameters

xTensor

Input whose trailing dims match normalized_shape.

normalized_shapelist of int or tuple of int

Trailing dims to normalise over. E.g. (d,) for a token-wise normalisation in a transformer with hidden size d.

weightTensor= None

Per-element scale

\gamma

of shape normalized_shape. Defaults to ones.

biasTensor= None

Per-element shift

\beta

of shape normalized_shape. Defaults to zeros.

epsfloat= 1e-05

Numerical safety added inside the square root.

Returns

Tensor

Same shape as x.

Notes

Math (reduction taken over the last $k$ axes, $k = \mathrm{len}(\text{normalized\_shape})$ ):

\begin{aligned} \mu &= \frac{1}{|S|}\sum_{j \in S} x_j \\ \sigma^2 &= \frac{1}{|S|}\sum_{j \in S} (x_j - \mu)^2 \\ y &= \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \end{aligned}

Because the reduction is per-sample, behaviour is identical at train and eval time — no running statistics needed. This is what makes LayerNorm so prevalent in sequence models (RNNs, transformers).

Examples

>>> import lucid
>>> from lucid.nn.functional import layer_norm
>>> x = lucid.randn(2, 10, 64)
>>> y = layer_norm(x, normalized_shape=(64,))
>>> y.shape
(2, 10, 64)

Used by 2