kl_div
→Tensorkl_div(x: Tensor, target: Tensor, size_average: bool | None = None, reduction: str = 'mean', log_target: bool = False)Kullback-Leibler divergence between two distributions.
Measures the "information gain" from approximating distribution (the target) with distribution (the input). Used heavily in knowledge distillation (matching student logits to a teacher's), variational inference (the ELBO's KL term), and policy-gradient regularisation in RL.
The convention here matches the reference framework: x is
(log-predicted), and target is
(target probability) — or if log_target=True.
Note this is asymmetric in its arguments — KL is not a metric.
Parameters
xTensortargetTensorlog_target=True. Same shape as x.size_averagebool or None= Nonereduction instead.reductionstr= 'mean'"none", "mean", "sum", or "batchmean"
(default "mean"). "batchmean" divides the summed
loss by the leading (batch) dimension and is the only
reduction that yields the mathematically correct KL value
in expectation.log_targetbool= FalseTrue, treat target as already-logged
(). This often avoids a redundant
/ round-trip.Returns
TensorScalar or full-shape per reduction.
Notes
Per-element loss:
Globally:
with equality iff almost everywhere. The standard
"mean" reduction divides by the element count, not the
batch size, so it under-reports the divergence value — prefer
"batchmean" whenever the absolute scale matters.
Examples
>>> import lucid
>>> from lucid.nn.functional import kl_div, log_softmax
>>> log_q = log_softmax(lucid.tensor([[2.0, 0.5, 0.1]]), dim=1)
>>> p = lucid.tensor([[0.8, 0.15, 0.05]])
>>> kl_div(log_q, p, reduction="batchmean")
Tensor(0.0641...)