kl_div

→Tensor

kl_div(x: Tensor, target: Tensor, size_average: bool | None = None, reduction: ReductionKL = 'mean', log_target: bool = False)

source edit

Kullback-Leibler divergence between two distributions.

Measures the "information gain" from approximating distribution $p$ (the target) with distribution $q$ (the input). Used heavily in knowledge distillation (matching student logits to a teacher's), variational inference (the ELBO's KL term), and policy-gradient regularisation in RL.

The convention here matches the reference framework: x is $\log q$ (log-predicted), and target is $p$ (target probability) — or $\log p$ if log_target=True. Note this is asymmetric in its arguments — KL is not a metric.

Parameters

xTensor

Log-probabilities of the predicted distribution

\log q

, any shape.

targetTensor

Probabilities of the target distribution

p

, or its log when log_target=True. Same shape as x.

size_averagebool or None= None

Deprecated. Retained for signature compatibility; ignored — use reduction instead.

reductionstr= 'mean'

"none", "mean", "sum", or "batchmean" (default "mean"). "batchmean" divides the summed loss by the leading (batch) dimension and is the only reduction that yields the mathematically correct KL value in expectation.

log_targetbool= False

When True, treat target as already-logged (

\log p

). This often avoids a redundant

\log

\exp

round-trip.

Returns

Tensor

Scalar or full-shape per reduction.

Notes

Per-element loss:

L_i = p_i \cdot (\log p_i - \log q_i)

Globally:

D_{\mathrm{KL}}(p \,\|\, q) = \sum_i p_i \log \tfrac{p_i}{q_i} \ge 0,

with equality iff $p = q$ almost everywhere. The standard "mean" reduction divides by the element count, not the batch size, so it under-reports the divergence value — prefer "batchmean" whenever the absolute scale matters.

Examples

>>> import lucid
>>> from lucid.nn.functional import kl_div, log_softmax
>>> log_q = log_softmax(lucid.tensor([[2.0, 0.5, 0.1]]), dim=1)
>>> p = lucid.tensor([[0.8, 0.15, 0.05]])
>>> kl_div(log_q, p, reduction="batchmean")
Tensor(0.0641...)

Used by 2

>>> import lucid >>> from lucid.nn.functional import kl_div, log_softmax >>> log_q = log_softmax(lucid.tensor([[2.0, 0.5, 0.1]]), dim=1) >>> p = lucid.tensor([[0.8, 0.15, 0.05]]) >>> kl_div(log_q, p, reduction="batchmean") Tensor(0.0641...)