softmax

→Tensor

softmax(x: Tensor, dim: int | None = None)

source edit

Implementing kernel

Apply the softmax function along a dimension.

Converts a vector of real-valued logits into a probability distribution: the outputs are non-negative and sum to one along dim. The central tool for multi-class classification heads and attention weights.

Parameters

xTensor

Input tensor of any shape (the "logits").

dimint= None

Dimension along which softmax is computed. Defaults to -1 (the last axis).

Returns

Tensor

Same-shape tensor whose entries along dim form a probability simplex (each non-negative, summing to 1).

Notes

Mathematical definition (per-vector along dim):

\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}

A naïve implementation overflows for large positive x; the engine uses the standard log-sum-exp shift $x_i \mapsto x_i - \max_j x_j$ to evaluate it in finite precision.

For loss computation, prefer log_softmax followed by nll_loss (or cross_entropy end-to-end), since the composition of log(softmax(...)) loses precision. The gradient has the convenient closed form $\partial p_i / \partial x_j = p_i (\delta_{ij} - p_j)$ — the Jacobian factorises out cleanly during backprop through softmax-CE.

Examples

>>> import lucid
>>> from lucid.nn.functional import softmax
>>> logits = lucid.tensor([[1.0, 2.0, 3.0]])
>>> p = softmax(logits, dim=1)
>>> p
Tensor([[0.0900, 0.2447, 0.6652]])
>>> p.sum(dim=1)
Tensor([1.0000])

Used by 5

>>> import lucid >>> from lucid.nn.functional import softmax >>> logits = lucid.tensor([[1.0, 2.0, 3.0]]) >>> p = softmax(logits, dim=1) >>> p Tensor([[0.0900, 0.2447, 0.6652]]) >>> p.sum(dim=1) Tensor([1.0000])