SwinTransformer¶

Transformer Vision Transformer Image Classification

class lucid.models.SwinTransformer(img_size: int = 224, patch_size: int = 4, in_channels: int = 3, num_classes: int = 1000, embed_dim: int = 96, depths: list[int] = [2, 2, 6, 2], num_heads: list[int] = [3, 6, 12, 24], windows_size: int = 7, mlp_ratio: float = 4.0, qkv_bias: bool = True, qk_scale: float | None = None, drop_rate: float = 0.0, attn_drop_rate: float = 0.0, drop_path_rate: float = 0.1, norm_layer: ~typing.Type[~lucid.nn.module.Module] = <class 'lucid.nn.modules.norm.LayerNorm'>, abs_pos_emb: bool = False, patch_norm: bool = True)¶

The SwinTransformer class implements a hierarchical vision transformer with shifted windows, designed for image recognition and dense prediction tasks. Unlike the original Vision Transformer (ViT), which processes fixed-size patches in a flat manner, the Swin Transformer divides the image into patches, computes self-attention within local windows, and then shifts these windows to enable cross-window interactions. This design improves computational efficiency and allows the model to capture both local and global dependencies.

Class Signature¶

class SwinTransformer(
    img_size: int = 224,
    patch_size: int = 4,
    in_channels: int = 3,
    num_classes: int = 1000,
    embed_dim: int = 96,
    depths: list[int] = [2, 2, 6, 2],
    num_heads: list[int] = [3, 6, 12, 24],
    windows_size: int = 7,
    mlp_ratio: float = 4.0,
    qkv_bias: bool = True,
    qk_scale: float | None = None,
    drop_rate: float = 0.0,
    attn_drop_rate: float = 0.0,
    drop_path_rate: float = 0.1,
    norm_layer: Type[nn.Module] = nn.LayerNorm,
    abs_pos_emb: bool = False,
    patch_norm: bool = True,
)

Parameters¶

img_size (int): Size of the input image (assumes square images).
patch_size (int): Size of the patches the image is divided into.
in_channels (int): Number of input channels (e.g., 3 for RGB images).
num_classes (int): Number of output classes for classification.
embed_dim (int): Dimension of the embedding for the first stage.
depths (list[int]): A list specifying the number of transformer blocks in each stage.
num_heads (list[int]): A list specifying the number of attention heads in each stage.
windows_size (int): Size of the local window for self-attention.
mlp_ratio (float): Ratio of the hidden dimension in the MLP relative to the embedding dimension.
qkv_bias (bool): Whether to include a learnable bias in the query, key, and value projections.
qk_scale (float | None): Override the default scaling for the query and key, if provided.
drop_rate (float): Dropout probability applied throughout the model.
attn_drop_rate (float): Dropout probability for the attention weights.
drop_path_rate (float): Stochastic depth rate for regularization.
norm_layer (Type[nn.Module]): Normalization layer to be used (default is nn.LayerNorm).
abs_pos_emb (bool): Whether to use absolute positional embedding.
patch_norm (bool): Whether to apply normalization after patch embedding.

Examples¶

>>> import lucid.models as models
>>> swin = models.SwinTransformer(
...     img_size=224,
...     patch_size=4,
...     in_channels=3,
...     num_classes=1000,
...     embed_dim=96,
...     depths=[2, 2, 6, 2],
...     num_heads=[3, 6, 12, 24],
... )