ViTConfig

class lucid.models.ViTConfig(image_size: int = 224, patch_size: int = 16, in_channels: int = 3, num_classes: int = 1000, embedding_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_dim: int = 3072, dropout_rate: float = 0.1)

ViTConfig stores the patch embedding and transformer encoder settings used by lucid.models.ViT. It defines the input image size, patch size, embedding width, encoder depth, attention heads, MLP width, and dropout.

Class Signature

@dataclass
class ViTConfig:
    image_size: int = 224
    patch_size: int = 16
    in_channels: int = 3
    num_classes: int = 1000
    embedding_dim: int = 768
    depth: int = 12
    num_heads: int = 12
    mlp_dim: int = 3072
    dropout_rate: float = 0.1

Parameters

  • image_size (int): Input image size. Vision Transformer assumes square inputs.

  • patch_size (int): Patch size used by the strided patch embedding convolution.

  • in_channels (int): Number of input image channels.

  • num_classes (int): Number of output classes.

  • embedding_dim (int): Transformer token embedding width.

  • depth (int): Number of encoder layers.

  • num_heads (int): Number of attention heads in each encoder layer.

  • mlp_dim (int): Hidden width of the feedforward block inside each encoder layer.

  • dropout_rate (float): Dropout applied to token embeddings and encoder layers.

Validation

  • image_size, patch_size, in_channels, num_classes, embedding_dim, depth, num_heads, and mlp_dim must all be greater than 0.

  • image_size must be divisible by patch_size.

  • embedding_dim must be divisible by num_heads.

  • dropout_rate must be in the range [0, 1).

Usage

import lucid.models as models

config = models.ViTConfig(
    image_size=32,
    patch_size=8,
    in_channels=1,
    num_classes=10,
    embedding_dim=64,
    depth=2,
    num_heads=4,
    mlp_dim=128,
    dropout_rate=0.0,
)
model = models.ViT(config)