gelu

→Tensor

gelu(x: Tensor, approximate: GeluApproximate = 'none')

source edit

Implementing kernel

Gaussian Error Linear Unit activation.

Smooth, non-monotonic activation that has largely replaced ReLU in transformer architectures. Approximates $x \cdot \mathbb{1}_{x>0}$ but is differentiable everywhere and lets a small negative signal pass through, which improves gradient flow in deep networks.

Parameters

xTensor

Input tensor of any shape; activation is element-wise.

approximatestr= 'none'

Either "none" (default, exact formula via erf) or "tanh" (faster polynomial approximation used in BERT / Hendrycks 2016).

Returns

Tensor

Activated tensor with the same shape as x.

Notes

Exact form:

\text{GELU}(x) = x \, \Phi(x) = \frac{x}{2}\left[1 + \text{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right]

where $\Phi$ is the standard normal CDF. The "tanh" approximation is

\text{GELU}(x) \approx \frac{x}{2}\left[1 + \tanh\!\left(\sqrt{\tfrac{2}{\pi}}\,(x + 0.044715\, x^3)\right)\right].

Unlike ReLU, GELU has a non-zero gradient everywhere — useful for training very deep transformer stacks. Introduced by Hendrycks & Gimpel (2016) and adopted widely after BERT.

Examples

>>> import lucid
>>> from lucid.nn.functional import gelu
>>> x = lucid.tensor([-1.0, 0.0, 1.0, 2.0])
>>> gelu(x)
Tensor([-0.1587,  0.0000,  0.8413,  1.9545])

Used by 4