cvt_w24

lucid.models.cvt_w24(num_classes: int = 1000, **kwargs) CvT

The cvt_w24 function instantiates a CvT-W24 model, a variant of the Convolutional Vision Transformer (CvT). This model integrates depthwise convolutions into the self-attention mechanism, enhancing local feature extraction while maintaining global contextual understanding. The CvT-W24 architecture is designed to offer high performance in image classification tasks, with a larger scale and more transformer blocks compared to smaller variants.

Total Parameters: 277,196,392

Function Signature

@register_model
def cvt_w24(num_classes: int = 1000, **kwargs) -> CvT

Parameters

  • num_classes (int, optional): The number of output classes for classification. Default is 1000.

  • kwargs (dict, optional): Additional parameters for customization, including:

    • embed_dim (list[int]): A list specifying the embedding dimensions for different stages. The default for CvT-W24 is [192, 768, 1024], indicating the number of channels at each stage of the model.

    • depths (list[int]): A list specifying the number of transformer blocks in each stage. The default configuration for CvT-W24 is [2, 2, 20], representing the depth of the transformer at each stage.

    • num_heads (list[int]): A list specifying the number of attention heads in each stage. The default values for CvT-W24 are [3, 12, 16], enabling multi-head self-attention at different scales.

Returns

  • CvT: An instance of the CvT class configured as a CvT-W24 vision transformer.

Examples

>>> import lucid.models as models
>>> model = models.cvt_w24()
>>> print(model)
CvT(num_classes=1000, embed_dim=[192, 768, 1024], depths=[2, 2, 20], num_heads=[3, 12, 16], ...)