EfficientFormer

Transformer Vision Transformer Image Classification

class lucid.models.EfficientFormer(depths: list[int], embed_dims: int | None = None, in_channels: int = 3, num_classes: int = 1000, global_pool: bool = True, downsamples: list[bool] | None = None, num_vit: int = 0, mlp_ratios: float = 4.0, pool_size: int = 3, layer_scale_init_value: float = 1e-05, act_layer: type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.activation.GELU'>, norm_layer: type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.BatchNorm2d'>, norm_layer_cl: type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.LayerNorm'>, drop_rate: float = 0.0, proj_drop_rate: float = 0.0, drop_path_rate: float = 0.0)

The EfficientFormer module implements a lightweight hybrid vision transformer architecture optimized for mobile and edge devices. It combines convolutional and transformer-based components in a streamlined hierarchical design to achieve high efficiency and competitive accuracy.

EfficientFormer architecture

Class Signature

class EfficientFormer(nn.Module):
    def __init__(
        self,
        depths: list[int],
        embed_dims: int | None = None,
        in_channels: int = 3,
        num_classes: int = 1000,
        global_pool: bool = True,
        downsamples: list[bool] | None = None,
        num_vit: int = 0,
        mlp_ratios: float = 4.0,
        pool_size: int = 3,
        layer_scale_init_value: float = 1e-5,
        act_layer: type[nn.Module] = nn.GELU,
        norm_layer: type[nn.Module] = nn.BatchNorm2d,
        norm_layer_cl: type[nn.Module] = nn.LayerNorm,
        drop_rate: float = 0.0,
        proj_drop_rate: float = 0.0,
        drop_path_rate: float = 0.0,
    ) -> None

Parameters

  • depths (list[int]): Depth of each stage in the EfficientFormer hierarchy.

  • embed_dims (int | None, optional): Base embedding dimension. If None, it is derived from model size. Default is None.

  • in_channels (int, optional): Number of input image channels. Default is 3.

  • num_classes (int, optional): Number of classes for the classification head. Default is 1000.

  • global_pool (bool, optional): Whether to apply global average pooling before the classifier. Default is True.

  • downsamples (list[bool] | None, optional): Whether to downsample at each stage. If None, defaults to standard configuration.

  • num_vit (int, optional): Number of stages using transformer blocks. Remaining stages use convolutions. Default is 0.

  • mlp_ratios (float, optional): MLP expansion ratio in transformer blocks. Default is 4.0.

  • pool_size (int, optional): Kernel size for the pooling layer in convolution stages. Default is 3.

  • layer_scale_init_value (float, optional): Initial value for layer scale in transformer stages. Default is 1e-5.

  • act_layer (type[nn.Module], optional): Activation function used throughout the model. Default is nn.GELU.

  • norm_layer (type[nn.Module], optional): Normalization layer used in convolution stages. Default is nn.BatchNorm2d.

  • norm_layer_cl (type[nn.Module], optional): Normalization layer used in classification layers. Default is nn.LayerNorm.

  • drop_rate (float, optional): Dropout rate for embedding dropout. Default is 0.0.

  • proj_drop_rate (float, optional): Dropout rate for projection layers. Default is 0.0.

  • drop_path_rate (float, optional): Drop path rate for stochastic depth. Default is 0.0.

Architecture

EfficientFormer combines the strengths of convolutional inductive bias and transformer flexibility in a stage-wise hybrid design:

  1. Convolutional Stages:

    • Early stages use lightweight depthwise-separable convolutions.

    • Convolution + BatchNorm + Activation (CBR) blocks dominate these layers.

  2. Transformer Stages:

    • Later stages use efficient self-attention with MLPs and normalization.

    • Employs simplified attention to reduce complexity while maintaining accuracy.

  3. MetaBlock:

    • The MetaBlock is the core computational unit used in both convolutional and transformer stages.

    • In convolutional mode, MetaBlock uses depthwise separable convolutions with residual connections.

    • In transformer mode, it applies a simplified attention mechanism followed by a feed-forward MLP block.

    • It supports layer scaling, drop path, and residual connections, enabling stable and deep training.

    • MetaBlocks unify the design across different stages, reducing architectural complexity while retaining flexibility.

  4. Hierarchical Design:

    • Embedding dimensions grow progressively through stages.

    • Optional downsampling after each stage.

  5. Classification Head:

    • Global average pooling followed by a LayerNorm and linear classifier.

Examples

Basic Usage

import lucid
from lucid.models.transformer import EfficientFormer

# Create a default EfficientFormer model
model = EfficientFormer(depths=[2, 2, 6, 2], num_vit=2)

input_tensor = lucid.randn(1, 3, 224, 224)
output = model(input_tensor)

print(output.shape)  # Shape: (1, 1000)

Custom Configuration

model = EfficientFormer(
    depths=[3, 3, 9, 3],
    embed_dims=64,
    num_vit=3,
    num_classes=100,
    drop_rate=0.1,
    drop_path_rate=0.1
)

input_tensor = lucid.randn(1, 3, 224, 224)
output = model(input_tensor)
print(output.shape)  # Shape: (1, 100)

Tip

Increase the number of transformer stages (num_vit) to enhance long-range feature modeling.

Warning

The depths list must match the number of model stages, and num_vit should not exceed that length.