class

Linear

extendsModule
Linear(in_features: int, out_features: int, bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)
source

Apply a learnable affine transformation to incoming data.

Computes the linear map

y=xW+b\mathbf{y} = \mathbf{x} \mathbf{W}^{\top} + \mathbf{b}

where WRdout×din\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}} is the weight matrix and bRdout\mathbf{b} \in \mathbb{R}^{d_{\text{out}}} is the optional bias vector.

Parameters

in_featuresint
Dimensionality of each input sample (dind_{\text{in}}).
out_featuresint
Dimensionality of each output sample (doutd_{\text{out}}).
biasbool= True
If True (default) a learnable bias b\mathbf{b} is added to the output. Set to False when a subsequent normalization layer already absorbs the bias (e.g. BatchNorm1d).
deviceDeviceLike= None
Device on which the initial parameters are allocated ('cpu' or 'metal'). Defaults to the global default device.
dtypeDTypeLike= None
Floating-point dtype for the initial parameters. Defaults to the global default dtype (float32).

Attributes

weightParameter
Learnable weight matrix of shape (out_features, in_features). Initialized with Kaiming uniform sampling: WijU ⁣(6(1+a2)din,  6(1+a2)din)\mathbf{W}_{ij} \sim \mathcal{U}\!\left( -\sqrt{\tfrac{6}{(1 + a^2)\,d_{\text{in}}}},\; \sqrt{\tfrac{6}{(1 + a^2)\,d_{\text{in}}}} \right) where a=5a = \sqrt{5} is the default negative-slope parameter. This keeps gradient variance roughly constant across layers at initialization — critical for training stability in deep networks.
biasParameter or None
Learnable bias vector of shape (out_features,). Initialized with uniform sampling over [1din,  1din]\left[-\tfrac{1}{\sqrt{d_{\text{in}}}},\; \tfrac{1}{\sqrt{d_{\text{in}}}}\right]. None when bias=False.

Notes

  • Input: (,din)(\ast, d_{\text{in}}) — any number of leading batch dimensions followed by in_features.
  • Output: (,dout)(\ast, d_{\text{out}}) — same leading dimensions, last axis replaced by out_features.

Linear is the most common building block in feed-forward sub-layers (e.g. the MLP inside a Transformer block uses two Linear layers with a non-linearity in between). When composing many layers in sequence the Kaiming initialization ensures that neither the forward activations nor the backward gradients explode or vanish at the start of training.

Examples

Basic usage with a 2-D input:
>>> import lucid
>>> import lucid.nn as nn
>>> m = nn.Linear(20, 10)
>>> x = lucid.randn(4, 20)   # batch of 4, 20 features each
>>> y = m(x)
>>> y.shape
(4, 10)
Higher-dimensional inputs (batch + sequence):
>>> m = nn.Linear(512, 256)
>>> x = lucid.randn(2, 32, 512)   # (batch, seq_len, d_model)
>>> m(x).shape
(2, 32, 256)
Disable bias for use before a normalization layer:
>>> m_no_bias = nn.Linear(128, 64, bias=False)
>>> m_no_bias.bias is None
True
>>> lucid.randn(8, 128).shape == (8, 128)
True

Methods (4)

dunder

__init__

None
__init__(in_features: int, out_features: int, bias: bool = True, device: DeviceLike = None, dtype: DTypeLike = None)
source

Initialise the Linear module. See the class docstring for parameter semantics.

fn

reset_parameters

None
reset_parameters()
source

Initialize weight with Kaiming uniform and bias with uniform fan_in bound.

fn

forward

Tensor
forward(x: Tensor)
source

Apply the linear transformation to the input tensor.

Parameters

inputTensor
Input tensor of shape (,in_features)(*, \text{in\_features}).

Returns

Tensor

Output tensor of shape (,out_features)(*, \text{out\_features}).

fn

extra_repr

str
extra_repr()
source

Return a string representation of the layer's configuration.