fn

kl_div

Tensor
kl_div(x: Tensor, target: Tensor, size_average: bool | None = None, reduction: str = 'mean', log_target: bool = False)
source

Kullback-Leibler divergence between two distributions.

Measures the "information gain" from approximating distribution pp (the target) with distribution qq (the input). Used heavily in knowledge distillation (matching student logits to a teacher's), variational inference (the ELBO's KL term), and policy-gradient regularisation in RL.

The convention here matches the reference framework: x is logq\log q (log-predicted), and target is pp (target probability) — or logp\log p if log_target=True. Note this is asymmetric in its arguments — KL is not a metric.

Parameters

xTensor
Log-probabilities of the predicted distribution logq\log q, any shape.
targetTensor
Probabilities of the target distribution pp, or its log when log_target=True. Same shape as x.
size_averagebool or None= None
Deprecated. Retained for signature compatibility; ignored — use reduction instead.
reductionstr= 'mean'
"none", "mean", "sum", or "batchmean" (default "mean"). "batchmean" divides the summed loss by the leading (batch) dimension and is the only reduction that yields the mathematically correct KL value in expectation.
log_targetbool= False
When True, treat target as already-logged (logp\log p). This often avoids a redundant log\log / exp\exp round-trip.

Returns

Tensor

Scalar or full-shape per reduction.

Notes

Per-element loss:

Li=pi(logpilogqi)L_i = p_i \cdot (\log p_i - \log q_i)

Globally:

DKL(pq)=ipilogpiqi0,D_{\mathrm{KL}}(p \,\|\, q) = \sum_i p_i \log \tfrac{p_i}{q_i} \ge 0,

with equality iff p=qp = q almost everywhere. The standard "mean" reduction divides by the element count, not the batch size, so it under-reports the divergence value — prefer "batchmean" whenever the absolute scale matters.

Examples

>>> import lucid
>>> from lucid.nn.functional import kl_div, log_softmax
>>> log_q = log_softmax(lucid.tensor([[2.0, 0.5, 0.1]]), dim=1)
>>> p = lucid.tensor([[0.8, 0.15, 0.05]])
>>> kl_div(log_q, p, reduction="batchmean")
Tensor(0.0641...)