class

MultiheadAttention

extendsModule

MultiheadAttention(embed_dim: int, num_heads: int, dropout: float = 0.0, bias: bool = True, add_bias_kv: bool = False, add_zero_attn: bool = False, kdim: int | None = None, vdim: int | None = None, num_kv_heads: int | None = None, batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Multi-head scaled dot-product attention.

Implements the multi-head attention mechanism introduced in "Attention Is All You Need" (Vaswani et al., 2017). Each head independently computes scaled dot-product attention over a learned linear projection of the query, key and value inputs; the per-head outputs are then concatenated and projected once more to produce the final result.

Scaled dot-product attention for a single head:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V

where $d_k$ is the per-head key dimension (head_dim). The $1/\sqrt{d_k}$ scaling prevents the dot-products from growing so large that the softmax function is pushed into regions with extremely small gradients.

Multi-head attention uses $h$ parallel heads:

\text{head}_i = \text{Attention}(Q W_i^Q,\; K W_i^K,\; V W_i^V)

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O

where $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ , and $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$ are learned projection matrices.

When kdim and vdim both equal embed_dim, the three input projections are stored as a single fused weight in_proj_weight of shape (3 * embed_dim, embed_dim) and split at runtime, which is more cache-friendly on Apple Silicon.

Parameters

embed_dimint

Total dimension of the model,

d_{\text{model}}

. Must be divisible by num_heads.

num_headsint

Number of parallel attention heads

h

. Each head operates on a subspace of dimension head_dim = embed_dim // num_heads.

dropoutfloat= 0.0

Dropout probability applied to the attention weight matrix during training. Default: 0.0.

biasbool= True

If True, learnable bias terms are added to all input and output projection layers. Default: True.

add_bias_kvbool= False

If True, learnable bias rows bias_k and bias_v (each of shape (1, 1, embed_dim)) are appended to the key and value sequences along the sequence dimension before the attention computation. Useful for cross-attention scenarios where extra context tokens are desired. Default: False.

add_zero_attnbool= False

If True, a zero-valued row is appended to the key and value sequences. This can stabilise training in early steps by providing an "attend to nothing" option. Default: False.

kdimint or None= None

Feature dimension of the key input. When None (default), falls back to embed_dim and a fused in_proj_weight is used.

vdimint or None= None

Feature dimension of the value input. When None (default), falls back to embed_dim.

num_kv_headsint or None= None

Number of key/value heads for grouped-query attention (GQA) / multi-query attention (MQA). Must divide num_heads. None (default) → num_heads (standard multi-head attention). With fewer K/V heads the model projects a smaller key/value space — shared across num_heads // num_kv_heads query heads — which shrinks the K/V projection and, during incremental decoding, the K/V cache; each K/V head is repeated to match the query heads just before attention. 1 is MQA (all query heads share one K/V head). GQA forces the separate-projection layout (no fused in_proj_weight).

batch_firstbool= False

Controls the expected layout of all input and output tensors.

False (default): (seq_len, batch, embed_dim) — the classic sequence-first convention.
True: (batch, seq_len, embed_dim) — more intuitive for most modern use-cases.

deviceDeviceLike= None

Device on which to allocate parameters. None defaults to the current default device.

dtypeDTypeLike= None

Data type for all parameters. None defaults to the current default floating-point type.

Attributes

embed_dimint

Total model dimension passed at construction.

num_headsint

Number of attention (query) heads.

num_kv_headsint

Number of key/value heads (== num_heads for standard MHA; fewer for GQA/MQA).

head_dimint

Per-head dimension: embed_dim // num_heads.

kdimint

Effective key feature dimension.

vdimint

Effective value feature dimension.

dropoutfloat

Attention weight dropout probability.

batch_firstbool

Whether inputs are (batch, seq, feature).

in_proj_weightParameter or None

Fused (3 * embed_dim, embed_dim) projection weight used when kdim == vdim == embed_dim. Sliced at runtime into Q, K, V sub-weights. None when using separate weights.

q_proj_weightParameter or None

Separate query projection weight (embed_dim, embed_dim). Non-None only when kdim or vdim differs from embed_dim.

k_proj_weightParameter or None

Separate key projection weight (embed_dim, kdim).

v_proj_weightParameter or None

Separate value projection weight (embed_dim, vdim).

in_proj_biasParameter or None

Bias for the fused input projection (3 * embed_dim,). None when bias=False.

out_proj_weightParameter

Output projection weight (embed_dim, embed_dim).

out_proj_biasParameter or None

Output projection bias (embed_dim,). None when bias=False.

bias_kParameter or None

Learnable key bias row (1, 1, embed_dim). Non-None when add_bias_kv=True.

bias_vParameter or None

Learnable value bias row (1, 1, embed_dim). Non-None when add_bias_kv=True.

add_zero_attnbool

Whether a zero row is appended to K and V.

Notes

The shapes below use the following notation:

$N$ — batch size
$L$ — target (query) sequence length
$S$ — source (key / value) sequence length
$E$ — embed_dim

When batch_first=False (default):

query: $(L, N, E)$
key: $(S, N, E_k)$ where $E_k$ = kdim
value: $(S, N, E_v)$ where $E_v$ = vdim
Output attn_output: $(L, N, E)$
Output attn_weights: $(N, L, S)$ when need_weights=True and average_attn_weights=True; $(N, h, L, S)$ when average_attn_weights=False.

When batch_first=True:

query: $(N, L, E)$
key / value: $(N, S, E_{k/v})$
Output attn_output: $(N, L, E)$

Why scale by $1/\sqrt{d_k}$ ? As $d_k$ grows, the dot-products $QK^\top$ accumulate over more dimensions and their magnitude grows like $\sqrt{d_k}$ under the assumption of unit-variance inputs. Without the scale factor the softmax would saturate, producing near-one-hot distributions and vanishingly small gradients. Dividing by $\sqrt{d_k}$ restores roughly unit variance before the softmax.

Causal masking (is_causal=True): An upper-triangular $-\infty$ mask is added to the score matrix so that position $i$ cannot attend to any position $j > i$ . This implements the autoregressive constraint needed for language model decoding.

Fused vs. separate projections: When kdim == vdim == embed_dim, the Q/K/V projections share a single (3E, E) weight matrix. This layout allows a single linear call plus a cheap split_at on the result, which amortises kernel-launch overhead and improves cache locality on the MLX / Accelerate backends.

Checkpoint compatibility: State-dicts from the reference framework store the output projection under the key out_proj.weight / out_proj.bias (a sub-module named out_proj). Lucid's _load_from_state_dict hook transparently remaps those keys to the flat out_proj_weight / out_proj_bias attributes used here, so pre-trained weights can be loaded directly.

Examples

**Basic self-attention** (sequence-first layout):
>>> import lucid
>>> import lucid.nn as nn
>>> mha = nn.MultiheadAttention(embed_dim=64, num_heads=8)
>>> # Sequence-first: (seq_len, batch, embed_dim)
>>> x = lucid.randn(10, 2, 64)          # 10 tokens, batch=2
>>> out, weights = mha(x, x, x)
>>> out.shape
(10, 2, 64)
>>> weights.shape                        # averaged over heads
(2, 10, 10)
**Cross-attention with batch_first layout and causal mask**:
>>> mha = nn.MultiheadAttention(embed_dim=64, num_heads=8,
...                             batch_first=True)
>>> q = lucid.randn(2, 6, 64)           # batch=2, 6 query tokens
>>> kv = lucid.randn(2, 10, 64)         # 10 key/value tokens
>>> out, _ = mha(q, kv, kv, need_weights=False)
>>> out.shape
(2, 6, 64)
**Cross-modal attention with different key/value dimensions**:
>>> mha = nn.MultiheadAttention(embed_dim=128, num_heads=4,
...                             kdim=64, vdim=64)
>>> q = lucid.randn(5, 1, 128)
>>> kv = lucid.randn(7, 1, 64)
>>> out, weights = mha(q, kv, kv)
>>> out.shape
(5, 1, 128)

Used by 3

Constructors

dunder

init

→None

__init__(embed_dim: int, num_heads: int, dropout: float = 0.0, bias: bool = True, add_bias_kv: bool = False, add_zero_attn: bool = False, kdim: int | None = None, vdim: int | None = None, num_kv_heads: int | None = None, batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Initialise the MultiheadAttention module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(query: Tensor, key: Tensor, value: Tensor, key_padding_mask: Tensor | None = None, need_weights: bool = True, attn_mask: Tensor | None = None, average_attn_weights: bool = True, is_causal: bool = False, past_key_value: Cache | None = None, layer_idx: int = 0, cache_position: Tensor | None = None, use_cache: bool = False, is_cross_attention: bool = False)

source edit

Run the forward pass of the module.

Parameters

queryTensor

See the class docstring.

keyTensor

See the class docstring.

valueTensor

See the class docstring.

key_padding_maskTensor= None

See the class docstring.

need_weightsTensor= True

See the class docstring.

attn_maskTensor= None

See the class docstring.

average_attn_weightsTensor= True

See the class docstring.

is_causalTensor= False

See the class docstring.

past_key_value(Cache or None, optional, keyword - only)= None

Key/value cache for incremental decoding. None disables caching.

layer_idx(int, optional, keyword - only)= 0

Index of this attention layer within the cache.

cache_position(Tensor or None, optional, keyword - only)= None

Absolute positions of the query tokens; accepted for API parity.

use_cache(bool, optional, keyword - only)= False

Read/write past_key_value when True.

is_cross_attention(bool, optional, keyword - only)= False

True for decoder cross-attention (keys/values from the encoder memory, cached once); False for self-attention (cache grows).

Returns

Tensor

Output tensor; refer to the class docstring for the exact shape.

MultiheadAttention(embed_dim: int, num_heads: int, dropout: float = 0.0, bias: bool = True, add_bias_kv: bool = False, add_zero_attn: bool = False, kdim: int | None = None, vdim: int | None = None, num_kv_heads: int | None = None, batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

**Basic self-attention** (sequence-first layout): >>> import lucid >>> import lucid.nn as nn >>> mha = nn.MultiheadAttention(embed_dim=64, num_heads=8) >>> # Sequence-first: (seq_len, batch, embed_dim) >>> x = lucid.randn(10, 2, 64) # 10 tokens, batch=2 >>> out, weights = mha(x, x, x) >>> out.shape (10, 2, 64) >>> weights.shape # averaged over heads (2, 10, 10) **Cross-attention with batch_first layout and causal mask**: >>> mha = nn.MultiheadAttention(embed_dim=64, num_heads=8, ... batch_first=True) >>> q = lucid.randn(2, 6, 64) # batch=2, 6 query tokens >>> kv = lucid.randn(2, 10, 64) # 10 key/value tokens >>> out, _ = mha(q, kv, kv, need_weights=False) >>> out.shape (2, 6, 64) **Cross-modal attention with different key/value dimensions**: >>> mha = nn.MultiheadAttention(embed_dim=128, num_heads=4, ... kdim=64, vdim=64) >>> q = lucid.randn(5, 1, 128) >>> kv = lucid.randn(7, 1, 64) >>> out, weights = mha(q, kv, kv) >>> out.shape (5, 1, 128)

__init__(embed_dim: int, num_heads: int, dropout: float = 0.0, bias: bool = True, add_bias_kv: bool = False, add_zero_attn: bool = False, kdim: int | None = None, vdim: int | None = None, num_kv_heads: int | None = None, batch_first: bool = False, device: DeviceLike = None, dtype: DTypeLike = None)

forward(query: Tensor, key: Tensor, value: Tensor, key_padding_mask: Tensor | None = None, need_weights: bool = True, attn_mask: Tensor | None = None, average_attn_weights: bool = True, is_causal: bool = False, past_key_value: Cache | None = None, layer_idx: int = 0, cache_position: Tensor | None = None, use_cache: bool = False, is_cross_attention: bool = False)