GELU
ModuleGELU(approximate: str = 'none')Gaussian Error Linear Unit activation function.
Applies element-wise:
where is the cumulative distribution function of the standard normal distribution. Intuitively, GELU weights each input by the probability that a standard Gaussian random variable is smaller than it — inputs far into the positive tail pass through nearly unchanged, while those deep in the negative tail are suppressed.
When approximate="tanh" the following closed-form approximation is
used instead:
Parameters
approximatestr= 'none'"none" uses the exact erf-based formula;
"tanh" uses the faster tanh approximation. Default: "none".Notes
- Input: — any shape.
- Output: — same shape as input.
GELU is the default activation in transformer architectures (BERT, GPT) because its smooth non-linearity and non-zero gradient for all inputs improve training stability over ReLU for attention-based models.
Examples
>>> import lucid
>>> import lucid.nn as nn
>>> m = nn.GELU()
>>> x = lucid.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
>>> m(x)
tensor([-0.0454, -0.1587, 0. , 0.8413, 1.9545])
>>> # Fast tanh approximation — nearly identical for most inputs
>>> m_approx = nn.GELU(approximate="tanh")
>>> x = lucid.randn(4, 512)
>>> out = m_approx(x)
>>> out.shape
(4, 512)Methods (3)
__init__
→None__init__(approximate: str = 'none')Initialise the GELU module. See the class docstring for parameter semantics.
forward
→Tensorforward(x: Tensor)Apply the activation function element-wise.
Parameters
inputTensorReturns
TensorOutput tensor of the same shape as input.
extra_repr
→strextra_repr()Return a string representation of the layer's configuration.