class

FusedLinear

extendsModule

FusedLinear(in_features: int, out_features: int, activation: str = 'relu', bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Linear layer with a kernel-fused non-linear activation.

Computes

\mathbf{y} = \sigma\!\left(\mathbf{x}\mathbf{W}^{\top} + \mathbf{b}\right)

where $\sigma$ is one of the supported pointwise activations.

Inference mode dispatches to a single BLAS + Accelerate pass that avoids allocating the intermediate pre-activation tensor. Training mode falls back to unfused, differentiable ops so that the autograd engine can compute correct gradients through both the linear projection and the activation.

Parameters

in_featuresint

Dimensionality of each input sample (

d_{\text{in}}

out_featuresint

Dimensionality of each output sample (

d_{\text{out}}

activationstr= 'relu'

Fused activation function. Supported values:

'relu' (default) — rectified linear unit, $\sigma(z) = \max(0, z)$ . Fastest; preferred for intermediate hidden layers.
'gelu' — Gaussian error linear unit (tanh approximation), $\sigma(z) = z \cdot \tfrac{1}{2} \left[1 + \tanh\!\left(\sqrt{\tfrac{2}{\pi}} \left(z + 0.044715 z^3\right)\right)\right]$ . Preferred in Transformer MLP blocks.

biasbool= True

If True (default) a learnable bias is added before the activation. When bias=False the fused kernel is unavailable and the layer falls back to standard unfused ops even during inference.

deviceDeviceLike= None

Device for initial parameters.

dtypeDTypeLike= None

Dtype for initial parameters.

Attributes

weightParameter

Weight matrix of shape (out_features, in_features). Initialized with Kaiming uniform (same scheme as Linear).

biasParameter or None

Bias vector of shape (out_features,). None when bias=False.

activationstr

Name of the fused activation ('relu' or 'gelu').

in_featuresint

Stored input dimensionality.

out_featuresint

Stored output dimensionality.

Notes

Input: $(\ast, d_{\text{in}})$ .
Output: $(\ast, d_{\text{out}})$ after activation.

The kernel fusion benefit is most pronounced for inference-only workloads (e.g. model serving with lucid.no_grad()). During training, the unfused fallback ensures that every intermediate value needed for backpropagation is materialised correctly.

For bias=False the fused path is always skipped; prefer using bias=True to take advantage of fusion.

Examples

Inference with ReLU activation:
>>> import lucid
>>> import lucid.nn as nn
>>> m = nn.FusedLinear(64, 256, activation='relu')
>>> x = lucid.randn(4, 64)
>>> with lucid.no_grad():
...     y = m(x)   # single-pass fused kernel on CPU
>>> y.shape
(4, 256)
GELU activation for a Transformer MLP block:
>>> mlp = nn.FusedLinear(768, 3072, activation='gelu')
>>> x = lucid.randn(2, 16, 768)   # (batch, seq_len, d_model)
>>> with lucid.no_grad():
...     out = mlp(x)
>>> out.shape
(2, 16, 3072)

Used by 1

lucid.nn.modules

Constructors

dunder

init

→None

__init__(in_features: int, out_features: int, activation: str = 'relu', bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)

source edit

Initialise the FusedLinear module. See the class docstring for parameter semantics.

Instance methods

extra_repr

→str

extra_repr()

source edit

Return a string representation of the layer's configuration.

forward

→Tensor

forward(x: Tensor)

source edit

Apply the linear transformation to the input tensor.

Parameters

inputTensor

Input tensor of shape

(*, \text{in\_features})

Returns

Tensor

Output tensor of shape $(*, \text{out\_features})$ .

Inference with ReLU activation: >>> import lucid >>> import lucid.nn as nn >>> m = nn.FusedLinear(64, 256, activation='relu') >>> x = lucid.randn(4, 64) >>> with lucid.no_grad(): ... y = m(x) # single-pass fused kernel on CPU >>> y.shape (4, 256) GELU activation for a Transformer MLP block: >>> mlp = nn.FusedLinear(768, 3072, activation='gelu') >>> x = lucid.randn(2, 16, 768) # (batch, seq_len, d_model) >>> with lucid.no_grad(): ... out = mlp(x) >>> out.shape (2, 16, 3072)