scaled_dot_product_attention
→Tensorscaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor, attn_mask: Tensor | None = None, dropout_p: float = 0.0, is_causal: bool = False, scale: float | None = None)Scaled dot-product attention — the core of every Transformer block.
Computes the attention-weighted aggregation of value vectors using query-key dot products as similarity scores:
where is an optional additive mask (-inf to disallow
attention at a position, 0 to allow). The
factor keeps the softmax in a usable temperature regime as the head
dimension grows — without it, large dot products saturate
the softmax into a near one-hot distribution and gradients vanish.
Parameters
queryTensor(B, H, T, E). B batch, H heads, T query
positions, E head dimension .keyTensor(B, H, S, E).valueTensor(B, H, S, V). V may differ from E.attn_maskTensor= None(B, H, T, S). Use
large negative values (or -inf) at positions to mask out.
Mutually exclusive with is_causal.dropout_pfloat= 0.00.0.is_causalbool= FalseTrue, apply an upper-triangular causal mask so each query
position only attends to keys at the same or earlier positions
(autoregressive decoder self-attention).scalefloat= NoneReturns
TensorAttention output of shape (B, H, T, V).
Notes
Introduced in Attention Is All You Need (Vaswani et al., 2017). The implementation uses the log-sum-exp form of softmax for numerical stability under aggressive masking, and fuses the scale into the score matrix prior to softmax. Causal masking enables efficient autoregressive decoding when combined with a key/value cache.
Examples
>>> import lucid
>>> from lucid.nn.functional import scaled_dot_product_attention
>>> q = lucid.randn(2, 8, 16, 64) # (B, H, T, E)
>>> k = lucid.randn(2, 8, 16, 64)
>>> v = lucid.randn(2, 8, 16, 64)
>>> out = scaled_dot_product_attention(q, k, v, is_causal=True)
>>> out.shape
(2, 8, 16, 64)