class

KLDivLoss

extendsModule

KLDivLoss(reduction: ReductionKL = 'mean', log_target: bool = False)

source edit

Kullback–Leibler divergence loss.

Measures how one probability distribution diverges from a reference distribution. The input x must be log-probabilities (e.g. output of LogSoftmax), and the target y must be probabilities (or log-probabilities when log_target=True).

With log_target=False (default):

\ell(x, y) = y \cdot (\log y - x)

With log_target=True (target already in log-space):

\ell(x, y) = e^{y} \cdot (y - x)

The scalar is obtained by reducing $\ell$ according to reduction. Note that 'batchmean' (if supported) divides by the batch size $N$ , which corresponds to the mathematical KL definition.

Parameters

reductionstr= 'mean'

'none' | 'mean' (default) | 'sum' | 'batchmean'.

log_targetbool= False

If True, target is interpreted as log-probabilities. Default False.

Attributes

reductionstr

The reduction mode.

log_targetbool

Whether the target is in log-space.

Notes

Input x : $(*)$ — log-probabilities.
Target y : $(*)$ — probabilities (or log-probabilities when log_target=True).
Output : scalar for 'mean' / 'sum' / 'batchmean'; $(*)$ for 'none'.

KL divergence is asymmetric: $\text{KL}(P \| Q) \neq \text{KL}(Q \| P)$ .
Common applications include variational autoencoders (VAE), knowledge distillation, and training language models.
Passing raw probabilities (non-log) as x is a common mistake and will produce incorrect and potentially negative values.

Examples

Comparing two discrete distributions ('batchmean' follows the KL
mathematical convention):
>>> import lucid
>>> import lucid.nn as nn
>>> import lucid.nn.functional as F
>>> criterion = nn.KLDivLoss(reduction="batchmean")
>>> log_pred = F.log_softmax(lucid.tensor([[0.5, 1.0, 0.2]]), dim=1)
>>> target   = F.softmax(lucid.tensor([[0.3, 0.9, 0.4]]),   dim=1)
>>> loss = criterion(log_pred, target)
With log-space target (knowledge distillation style):
>>> import lucid
>>> import lucid.nn as nn
>>> import lucid.nn.functional as F
>>> criterion = nn.KLDivLoss(reduction="batchmean", log_target=True)
>>> log_p = F.log_softmax(lucid.tensor([[1.0, 2.0, 0.5]]), dim=1)
>>> log_q = F.log_softmax(lucid.tensor([[0.8, 1.5, 0.9]]), dim=1)
>>> loss = criterion(log_p, log_q)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(reduction: ReductionKL = 'mean', log_target: bool = False)

source edit

Initialise the KLDivLoss module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(x: Tensor, target: Tensor)

source edit

Compute the loss between predictions and targets.

Parameters

xTensor

Input tensor.

targetTensor

Input tensor.

Returns

Tensor

Scalar loss (or unreduced tensor depending on reduction).

Comparing two discrete distributions ('batchmean' follows the KL mathematical convention): >>> import lucid >>> import lucid.nn as nn >>> import lucid.nn.functional as F >>> criterion = nn.KLDivLoss(reduction="batchmean") >>> log_pred = F.log_softmax(lucid.tensor([[0.5, 1.0, 0.2]]), dim=1) >>> target = F.softmax(lucid.tensor([[0.3, 0.9, 0.4]]), dim=1) >>> loss = criterion(log_pred, target) With log-space target (knowledge distillation style): >>> import lucid >>> import lucid.nn as nn >>> import lucid.nn.functional as F >>> criterion = nn.KLDivLoss(reduction="batchmean", log_target=True) >>> log_p = F.log_softmax(lucid.tensor([[1.0, 2.0, 0.5]]), dim=1) >>> log_q = F.log_softmax(lucid.tensor([[0.8, 1.5, 0.9]]), dim=1) >>> loss = criterion(log_p, log_q)