fn

sinusoidal_embedding_2d

Tensor
sinusoidal_embedding_2d(height: int, width: int, embedding_dim: int, base: float = 10000.0, device: str = 'cpu')
source

Build the 2-D sinusoidal positional encoding from DETR (Carion et al., 2020).

Extends the 1-D sinusoidal encoding to spatial feature maps by concatenating two independent encodings — one for the column index and one for the row index — each occupying half of the embedding dimension. This gives a position-unique vector for every grid cell without learnable parameters, and is the encoding used by DETR (§A.4), DiT, and other 2-D image transformers.

Parameters

heightint
Feature-map height HH.
widthint
Feature-map width WW.
embedding_dimint
Per-position embedding size dd. Must be divisible by 4 — each axis contributes d/2d/2 dimensions of paired sin / cos values, so the half itself must be even.
basefloat= 10000.0
Frequency base. DETR uses 10_000.
devicestr= 'cpu'
Target device ("cpu" or "metal").

Returns

Tensor

(height * width, embedding_dim) float tensor, ordered row-major (outer loop r ∈ [0, H), inner loop c ∈ [0, W)).

Raises

ValueError
If embedding_dim is not divisible by 4.

Notes

Layout per position (r,c)(r, c):

out[rW+c,  :d/2]=PEcol(c)out[rW+c,  d/2:]=PErow(r)\begin{aligned} \text{out}[r \cdot W + c, \; :d/2] &= \text{PE}_\text{col}(c) \\ \text{out}[r \cdot W + c, \; d/2:] &= \text{PE}_\text{row}(r) \end{aligned}

where each axis-table is the standard 1-D encoding at dimension d/2d/2. Flatten the result into a sequence of length HWH \cdot W and add it to flattened image features before the first transformer block.

Examples

>>> import lucid
>>> from lucid.nn.functional import sinusoidal_embedding_2d
>>> pe = sinusoidal_embedding_2d(height=16, width=16, embedding_dim=128)
>>> pe.shape
(256, 128)