EfficientFormer¶
Transformer Vision Transformer Image Classification
- class lucid.models.EfficientFormer(depths: list[int], embed_dims: int | None = None, in_channels: int = 3, num_classes: int = 1000, global_pool: bool = True, downsamples: list[bool] | None = None, num_vit: int = 0, mlp_ratios: float = 4.0, pool_size: int = 3, layer_scale_init_value: float = 1e-05, act_layer: type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.activation.GELU'>, norm_layer: type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.BatchNorm2d'>, norm_layer_cl: type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.LayerNorm'>, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, drop_path_rate: float = 0.0)¶
The EfficientFormer module implements a lightweight hybrid vision transformer architecture optimized for mobile and edge devices. It combines convolutional and transformer-based components in a streamlined hierarchical design to achieve high efficiency and competitive accuracy.

Class Signature¶
class EfficientFormer(nn.Module):
def __init__(
self,
depths: list[int],
embed_dims: int | None = None,
in_channels: int = 3,
num_classes: int = 1000,
global_pool: bool = True,
downsamples: list[bool] | None = None,
num_vit: int = 0,
mlp_ratios: float = 4.0,
pool_size: int = 3,
layer_scale_init_value: float = 1e-5,
act_layer: type[nn.Module] = nn.GELU,
norm_layer: type[nn.Module] = nn.BatchNorm2d,
norm_layer_cl: type[nn.Module] = nn.LayerNorm,
drop_rate: float = 0.0,
proj_drop_rate: float = 0.0,
drop_path_rate: float = 0.0,
) -> None
Parameters¶
depths (list[int]): Depth of each stage in the EfficientFormer hierarchy.
embed_dims (int | None, optional): Base embedding dimension. If None, it is derived from model size. Default is None.
in_channels (int, optional): Number of input image channels. Default is 3.
num_classes (int, optional): Number of classes for the classification head. Default is 1000.
global_pool (bool, optional): Whether to apply global average pooling before the classifier. Default is True.
downsamples (list[bool] | None, optional): Whether to downsample at each stage. If None, defaults to standard configuration.
num_vit (int, optional): Number of stages using transformer blocks. Remaining stages use convolutions. Default is 0.
mlp_ratios (float, optional): MLP expansion ratio in transformer blocks. Default is 4.0.
pool_size (int, optional): Kernel size for the pooling layer in convolution stages. Default is 3.
layer_scale_init_value (float, optional): Initial value for layer scale in transformer stages. Default is 1e-5.
act_layer (type[nn.Module], optional): Activation function used throughout the model. Default is nn.GELU.
norm_layer (type[nn.Module], optional): Normalization layer used in convolution stages. Default is nn.BatchNorm2d.
norm_layer_cl (type[nn.Module], optional): Normalization layer used in classification layers. Default is nn.LayerNorm.
drop_rate (float, optional): Dropout rate for embedding dropout. Default is 0.0.
proj_drop_rate (float, optional): Dropout rate for projection layers. Default is 0.0.
drop_path_rate (float, optional): Drop path rate for stochastic depth. Default is 0.0.
Architecture¶
EfficientFormer combines the strengths of convolutional inductive bias and transformer flexibility in a stage-wise hybrid design:
Convolutional Stages:
Early stages use lightweight depthwise-separable convolutions.
Convolution + BatchNorm + Activation (CBR) blocks dominate these layers.
Transformer Stages:
Later stages use efficient self-attention with MLPs and normalization.
Employs simplified attention to reduce complexity while maintaining accuracy.
MetaBlock:
The MetaBlock is the core computational unit used in both convolutional and transformer stages.
In convolutional mode, MetaBlock uses depthwise separable convolutions with residual connections.
In transformer mode, it applies a simplified attention mechanism followed by a feed-forward MLP block.
It supports layer scaling, drop path, and residual connections, enabling stable and deep training.
MetaBlocks unify the design across different stages, reducing architectural complexity while retaining flexibility.
Hierarchical Design:
Embedding dimensions grow progressively through stages.
Optional downsampling after each stage.
Classification Head:
Global average pooling followed by a LayerNorm and linear classifier.
Examples¶
Basic Usage
import lucid
from lucid.models.transformer import EfficientFormer
# Create a default EfficientFormer model
model = EfficientFormer(depths=[2, 2, 6, 2], num_vit=2)
input_tensor = lucid.randn(1, 3, 224, 224)
output = model(input_tensor)
print(output.shape) # Shape: (1, 1000)
Custom Configuration
model = EfficientFormer(
depths=[3, 3, 9, 3],
embed_dims=64,
num_vit=3,
num_classes=100,
drop_rate=0.1,
drop_path_rate=0.1
)
input_tensor = lucid.randn(1, 3, 224, 224)
output = model(input_tensor)
print(output.shape) # Shape: (1, 100)
Tip
Increase the number of transformer stages (num_vit) to enhance long-range feature modeling.
Warning
The depths list must match the number of model stages, and num_vit should not exceed that length.