CrossViTConfig

class lucid.models.CrossViTConfig(img_size: tuple[int, int] | list[int] = (224, 224), patch_size: tuple[int, int] | list[int] = (12, 16), in_channels: int = 3, num_classes: int = 1000, embed_dim: tuple[int, int] | list[int] = (192, 384), depth: tuple[tuple[int, int, int], ...] | list[list[int]] = ((1, 3, 1), (1, 3, 1), (1, 3, 1)), num_heads: tuple[int, int] | list[int] = (6, 12), mlp_ratio: tuple[float, float, float] | list[float] = (2.0, 2.0, 4.0), qkv_bias: bool = False, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: type[lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.LayerNorm'>, multi_conv: bool = False)

CrossViTConfig stores the dual-branch stage layout and classifier settings used by lucid.models.CrossViT. It defines the branch image sizes, patch sizes, embedding widths, multi-scale block depths, attention heads, and whether the dagger-style multi-convolution patch embedding is enabled.

Class Signature

@dataclass
class CrossViTConfig:
    img_size: tuple[int, int] | list[int] = (224, 224)
    patch_size: tuple[int, int] | list[int] = (12, 16)
    in_channels: int = 3
    num_classes: int = 1000
    embed_dim: tuple[int, int] | list[int] = (192, 384)
    depth: tuple[tuple[int, int, int], ...] | list[list[int]] = ((1, 3, 1), (1, 3, 1), (1, 3, 1))
    num_heads: tuple[int, int] | list[int] = (6, 12)
    mlp_ratio: tuple[float, float, float] | list[float] = (2.0, 2.0, 4.0)
    qkv_bias: bool = False
    qk_scale: float | None = None
    drop_rate: float = 0.0
    attn_drop_rate: float = 0.0
    drop_path_rate: float = 0.0
    norm_layer: type[nn.Module] = nn.LayerNorm
    multi_conv: bool = False

Parameters

  • img_size: Input resolution for the two CrossViT branches.

  • patch_size: Patch size for each branch.

  • in_channels (int): Number of input image channels.

  • num_classes (int): Number of output classes. Set to 0 to keep identity heads.

  • embed_dim: Embedding width for each branch.

  • depth: Sequence of multi-scale block specs. Each entry contains two branch depths and one fusion depth.

  • num_heads: Attention head counts for the two branches.

  • mlp_ratio: MLP expansion ratios for the two branches and the cross-fusion path.

  • qkv_bias (bool): Whether query, key, and value projections use bias.

  • qk_scale (float | None): Optional attention scaling override.

  • drop_rate, attn_drop_rate, drop_path_rate: Dropout and stochastic depth settings.

  • norm_layer (type[nn.Module]): Normalization layer used throughout the model.

  • multi_conv (bool): Whether to use the dagger-style multi-convolution patch embedding.

Validation

  • img_size, patch_size, embed_dim, and num_heads must each contain exactly two positive integers.

  • in_channels must be greater than 0.

  • num_classes must be greater than or equal to 0.

  • mlp_ratio must contain exactly three positive values.

  • depth must contain at least one stage spec, and each spec must contain two positive branch depths plus one non-negative fusion depth.

  • Each embedding width must be divisible by the corresponding head count.

  • Dropout rates must each be in [0, 1).

  • If multi_conv=True, every patch size must be either 12 or 16.

Usage

import lucid.models as models

config = models.CrossViTConfig(
    img_size=(32, 32),
    patch_size=(8, 16),
    in_channels=1,
    num_classes=10,
    embed_dim=(32, 64),
    depth=((1, 1, 0), (1, 1, 0)),
    num_heads=(4, 4),
    mlp_ratio=(2.0, 2.0, 1.0),
    drop_path_rate=0.0,
)
model = models.CrossViT(config)