class

FusedLinear

extendsModule
FusedLinear(in_features: int, out_features: int, activation: str = 'relu', bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)
source

Linear layer with a kernel-fused non-linear activation.

Computes

y=σ ⁣(xW+b)\mathbf{y} = \sigma\!\left(\mathbf{x}\mathbf{W}^{\top} + \mathbf{b}\right)

where σ\sigma is one of the supported pointwise activations.

Inference mode dispatches to a single BLAS + Accelerate pass that avoids allocating the intermediate pre-activation tensor. Training mode falls back to unfused, differentiable ops so that the autograd engine can compute correct gradients through both the linear projection and the activation.

Parameters

in_featuresint
Dimensionality of each input sample (dind_{\text{in}}).
out_featuresint
Dimensionality of each output sample (doutd_{\text{out}}).
activationstr= 'relu'
Fused activation function. Supported values:
  • 'relu' (default) — rectified linear unit, σ(z)=max(0,z)\sigma(z) = \max(0, z). Fastest; preferred for intermediate hidden layers.
  • 'gelu' — Gaussian error linear unit (tanh approximation), σ(z)=z12[1+tanh ⁣(2π(z+0.044715z3))]\sigma(z) = z \cdot \tfrac{1}{2} \left[1 + \tanh\!\left(\sqrt{\tfrac{2}{\pi}} \left(z + 0.044715 z^3\right)\right)\right]. Preferred in Transformer MLP blocks.
biasbool= True
If True (default) a learnable bias is added before the activation. When bias=False the fused kernel is unavailable and the layer falls back to standard unfused ops even during inference.
deviceDeviceLike= None
Device for initial parameters.
dtypeDTypeLike= None
Dtype for initial parameters.

Attributes

weightParameter
Weight matrix of shape (out_features, in_features). Initialized with Kaiming uniform (same scheme as Linear).
biasParameter or None
Bias vector of shape (out_features,). None when bias=False.
activationstr
Name of the fused activation ('relu' or 'gelu').
in_featuresint
Stored input dimensionality.
out_featuresint
Stored output dimensionality.

Notes

  • Input: (,din)(\ast, d_{\text{in}}).
  • Output: (,dout)(\ast, d_{\text{out}}) after activation.

The kernel fusion benefit is most pronounced for inference-only workloads (e.g. model serving with lucid.no_grad()). During training, the unfused fallback ensures that every intermediate value needed for backpropagation is materialised correctly.

For bias=False the fused path is always skipped; prefer using bias=True to take advantage of fusion.

Examples

Inference with ReLU activation:
>>> import lucid
>>> import lucid.nn as nn
>>> m = nn.FusedLinear(64, 256, activation='relu')
>>> x = lucid.randn(4, 64)
>>> with lucid.no_grad():
...     y = m(x)   # single-pass fused kernel on CPU
>>> y.shape
(4, 256)
GELU activation for a Transformer MLP block:
>>> mlp = nn.FusedLinear(768, 3072, activation='gelu')
>>> x = lucid.randn(2, 16, 768)   # (batch, seq_len, d_model)
>>> with lucid.no_grad():
...     out = mlp(x)
>>> out.shape
(2, 16, 3072)

Methods (3)

dunder

__init__

None
__init__(in_features: int, out_features: int, activation: str = 'relu', bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the FusedLinear module. See the class docstring for parameter semantics.

fn

forward

Tensor
forward(x: Tensor)
source

Apply the linear transformation to the input tensor.

Parameters

inputTensor
Input tensor of shape (,in_features)(*, \text{in\_features}).

Returns

Tensor

Output tensor of shape (,out_features)(*, \text{out\_features}).

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.