SwinTransformerV2Config¶

class lucid.models.SwinTransformerV2Config(img_size: int = 224, patch_size: int = 4, in_channels: int = 3, num_classes: int = 1000, embed_dim: int = 96, depths: tuple[int, ...] | list[int] = (2, 2, 6, 2), num_heads: tuple[int, ...] | list[int] = (3, 6, 12, 24), window_size: int = 7, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, norm_layer: Type[lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.LayerNorm'>, abs_pos_emb: bool = False, patch_norm: bool = True)¶

SwinTransformerV2Config stores the hierarchical stage layout and classifier settings used by lucid.models.SwinTransformer_V2. It exposes the same configuration surface as SwinTransformerConfig for the V2 attention and post-normalization architecture.

Class Signature¶

@dataclass
class SwinTransformerV2Config:
    img_size: int = 224
    patch_size: int = 4
    in_channels: int = 3
    num_classes: int = 1000
    embed_dim: int = 96
    depths: tuple[int, ...] | list[int] = (2, 2, 6, 2)
    num_heads: tuple[int, ...] | list[int] = (3, 6, 12, 24)
    window_size: int = 7
    mlp_ratio: float = 4.0
    qkv_bias: bool = True
    qk_scale: float | None = None
    drop_rate: float = 0.0
    attn_drop_rate: float = 0.0
    drop_path_rate: float = 0.1
    norm_layer: Type[nn.Module] = nn.LayerNorm
    abs_pos_emb: bool = False
    patch_norm: bool = True

Parameters¶

img_size (int): Input image size. Swin Transformer assumes square inputs.
patch_size (int): Patch size used by the patch embedding convolution.
in_channels (int): Number of input image channels.
num_classes (int): Number of output classes. Set to 0 to keep the identity head.
embed_dim (int): Base embedding width of the first stage.
depths: Number of transformer blocks in each hierarchical stage.
num_heads: Number of attention heads used by each stage.
window_size (int): Local self-attention window size.
mlp_ratio (float): Feedforward hidden width ratio inside each block.
qkv_bias (bool): Whether to learn query, key, and value projection biases.
qk_scale (float | None): Optional attention scaling override.
drop_rate (float): Dropout applied to patch tokens and block projections.
attn_drop_rate (float): Dropout applied to attention probabilities.
drop_path_rate (float): Stochastic depth rate across the stage stack.
norm_layer (Type[nn.Module]): Normalization layer used throughout the model.
abs_pos_emb (bool): Whether to add absolute positional embeddings.
patch_norm (bool): Whether to normalize tokens immediately after patch embedding.

Validation¶

img_size, patch_size, in_channels, embed_dim, and window_size must be greater than 0.
img_size must be greater than or equal to patch_size.
num_classes must be greater than or equal to 0.
depths must be non-empty and contain only positive integers.
num_heads must contain one positive integer per stage in depths.
Each stage embedding width must be divisible by its configured head count.
The patch resolution must be large enough to support the configured number of downsampling stages.
mlp_ratio must be greater than 0.
drop_rate, attn_drop_rate, and drop_path_rate must each be in [0, 1).

Usage¶

import lucid.models as models

config = models.SwinTransformerV2Config(
    img_size=32,
    patch_size=4,
    in_channels=1,
    num_classes=10,
    embed_dim=8,
    depths=(2, 2),
    num_heads=(2, 4),
    window_size=7,
    drop_rate=0.0,
    attn_drop_rate=0.0,
    drop_path_rate=0.0,
)
model = models.SwinTransformer_V2(config)