sinusoidal_embedding_2d

→Tensor

sinusoidal_embedding_2d(height: int, width: int, embedding_dim: int, base: float = 10000.0, device: str = 'cpu')

source edit

Build the 2-D sinusoidal positional encoding from DETR (Carion et al., 2020).

Extends the 1-D sinusoidal encoding to spatial feature maps by concatenating two independent encodings — one for the column index and one for the row index — each occupying half of the embedding dimension. This gives a position-unique vector for every grid cell without learnable parameters, and is the encoding used by DETR (§A.4), DiT, and other 2-D image transformers.

Parameters

heightint

Feature-map height

H

widthint

Feature-map width

W

embedding_dimint

Per-position embedding size

d

. Must be divisible by 4 — each axis contributes

d/2

dimensions of paired sin / cos values, so the half itself must be even.

basefloat= 10000.0

Frequency base. DETR uses 10_000.

devicestr= 'cpu'

Target device ("cpu" or "metal").

Returns

Tensor

(height * width, embedding_dim) float tensor, ordered row-major (outer loop r ∈ [0, H), inner loop c ∈ [0, W)).

Raises

ValueError

If embedding_dim is not divisible by 4.

Notes

Layout per position $(r, c)$ :

\begin{aligned} \text{out}[r \cdot W + c, \; :d/2] &= \text{PE}_\text{col}(c) \\ \text{out}[r \cdot W + c, \; d/2:] &= \text{PE}_\text{row}(r) \end{aligned}

where each axis-table is the standard 1-D encoding at dimension $d/2$ . Flatten the result into a sequence of length $H \cdot W$ and add it to flattened image features before the first transformer block.

Examples

>>> import lucid
>>> from lucid.nn.functional import sinusoidal_embedding_2d
>>> pe = sinusoidal_embedding_2d(height=16, width=16, embedding_dim=128)
>>> pe.shape
(256, 128)

Used by 2