CrossViTConfig¶
- class lucid.models.CrossViTConfig(img_size: tuple[int, int] | list[int] = (224, 224), patch_size: tuple[int, int] | list[int] = (12, 16), in_channels: int = 3, num_classes: int = 1000, embed_dim: tuple[int, int] | list[int] = (192, 384), depth: tuple[tuple[int, int, int], ...] | list[list[int]] = ((1, 3, 1), (1, 3, 1), (1, 3, 1)), num_heads: tuple[int, int] | list[int] = (6, 12), mlp_ratio: tuple[float, float, float] | list[float] = (2.0, 2.0, 4.0), qkv_bias: bool = False, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.0, norm_layer: type[lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.LayerNorm'>, multi_conv: bool = False)¶
CrossViTConfig stores the dual-branch stage layout and classifier settings used
by lucid.models.CrossViT. It defines the branch image sizes, patch
sizes, embedding widths, multi-scale block depths, attention heads, and whether
the dagger-style multi-convolution patch embedding is enabled.
Class Signature¶
@dataclass
class CrossViTConfig:
img_size: tuple[int, int] | list[int] = (224, 224)
patch_size: tuple[int, int] | list[int] = (12, 16)
in_channels: int = 3
num_classes: int = 1000
embed_dim: tuple[int, int] | list[int] = (192, 384)
depth: tuple[tuple[int, int, int], ...] | list[list[int]] = ((1, 3, 1), (1, 3, 1), (1, 3, 1))
num_heads: tuple[int, int] | list[int] = (6, 12)
mlp_ratio: tuple[float, float, float] | list[float] = (2.0, 2.0, 4.0)
qkv_bias: bool = False
qk_scale: float | None = None
drop_rate: float = 0.0
attn_drop_rate: float = 0.0
drop_path_rate: float = 0.0
norm_layer: type[nn.Module] = nn.LayerNorm
multi_conv: bool = False
Parameters¶
img_size: Input resolution for the two CrossViT branches.
patch_size: Patch size for each branch.
in_channels (int): Number of input image channels.
num_classes (int): Number of output classes. Set to 0 to keep identity heads.
embed_dim: Embedding width for each branch.
depth: Sequence of multi-scale block specs. Each entry contains two branch depths and one fusion depth.
num_heads: Attention head counts for the two branches.
mlp_ratio: MLP expansion ratios for the two branches and the cross-fusion path.
qkv_bias (bool): Whether query, key, and value projections use bias.
qk_scale (float | None): Optional attention scaling override.
drop_rate, attn_drop_rate, drop_path_rate: Dropout and stochastic depth settings.
norm_layer (type[nn.Module]): Normalization layer used throughout the model.
multi_conv (bool): Whether to use the dagger-style multi-convolution patch embedding.
Validation¶
img_size, patch_size, embed_dim, and num_heads must each contain exactly two positive integers.
in_channels must be greater than 0.
num_classes must be greater than or equal to 0.
mlp_ratio must contain exactly three positive values.
depth must contain at least one stage spec, and each spec must contain two positive branch depths plus one non-negative fusion depth.
Each embedding width must be divisible by the corresponding head count.
Dropout rates must each be in [0, 1).
If multi_conv=True, every patch size must be either 12 or 16.
Usage¶
import lucid.models as models
config = models.CrossViTConfig(
img_size=(32, 32),
patch_size=(8, 16),
in_channels=1,
num_classes=10,
embed_dim=(32, 64),
depth=((1, 1, 0), (1, 1, 0)),
num_heads=(4, 4),
mlp_ratio=(2.0, 2.0, 1.0),
drop_path_rate=0.0,
)
model = models.CrossViT(config)