Image Classification¶
LeNet¶
ConvNet Image Classification
LeNet is a pioneering CNN by Yann LeCun for digit recognition, combining convolutional, pooling, and fully connected layers. It introduced concepts like weight sharing and local receptive fields, shaping modern CNNs.
Lecun, Yann, et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE, vol. 86, no. 11, Nov. 1998, pp. 2278-2324. doi:10.1109/5.726791.
AlexNet¶
ConvNet Image Classification
AlexNet is a pioneering convolutional neural network introduced in 2012, known for its deep architecture and use of ReLU activations, dropout, and GPU acceleration. It achieved groundbreaking performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, popularizing deep learning for computer vision.
Krizhevsky, Alex, et al. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 1097-1105.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
AlexNet |
\((N,3,224,224)\) |
61,100,840 |
715.21M |
ZFNet¶
ConvNet Image Classification
ZFNet (Zeiler and Fergus Net) is a convolutional neural network that improved upon AlexNet by using smaller convolutional filters and visualizing learned features to better understand network behavior. It achieved state-of-the-art results in object recognition and provided insights into deep learning interpretability.
Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” European Conference on Computer Vision (ECCV), Springer, Cham, 2014, pp. 818-833. doi:10.1007/978-3-319-10590-1_53.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ZFNet |
\((N,3,224,224)\) |
62,357,608 |
1.20B |
VGGNet¶
ConvNet Image Classification
VGGNet is a deep convolutional neural network known for its simplicity and use of small 3x3 convolutional filters, which significantly improved object recognition accuracy.
Simonyan, Karen, and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv preprint arXiv:1409.1556, 2014.
Inception¶
ConvNet Image Classification
The Inception architecture, introduced in the GoogLeNet model, is a deep convolutional neural network designed for efficient feature extraction using parallel convolutional and pooling branches, reducing computational cost. It achieves this by combining multi-scale feature processing within each module, making it highly effective for image classification tasks.
Szegedy, Christian, et al. “Going Deeper with Convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
Inception-v1 (GoogLeNet) |
\((N,3,224,224)\) |
13,393,352 |
1.62B |
|
Inception-v3 |
\((N,3,299,299)\) |
30,817,392 |
3.20B |
|
Inception-v4 |
\((N,3,299,299)\) |
40,586,984 |
5.75B |
Inception-ResNet¶
ConvNet Image Classification
The Inception-ResNet architecture builds upon the Inception model by integrating residual connections, which improve gradient flow and training stability in very deep networks. This combination of Inception’s multi-scale feature processing with ResNet’s efficient backpropagation allows for a powerful and scalable design, suitable for a wide range of image classification tasks.
Szegedy, Christian, et al. “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.” Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 4278-4284.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
Inception-ResNet-v1 |
\((N,3,299,299)\) |
22,739,128 |
3.16B |
|
Inception-ResNet-v2 |
\((N,3,299,299)\) |
35,847,512 |
4.54B |
ResNet¶
ConvNet Image Classification
ResNets (Residual Networks) are deep neural network architectures that use skip connections (residual connections) to alleviate the vanishing gradient problem, enabling the training of extremely deep models. They revolutionized deep learning by introducing identity mappings, allowing efficient backpropagation and improved accuracy in tasks like image classification and object detection.
He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
He, Kaiming, et al. “Identity Mappings in Deep Residual Networks.” European Conference on Computer Vision (ECCV), Springer, 2016, pp. 630-645.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ResNet-18 |
\((N,3,224,224)\) |
11,689,512 |
1.84B |
|
ResNet-34 |
\((N,3,224,224)\) |
21,797,672 |
3.70B |
|
ResNet-50 |
\((N,3,224,224)\) |
25,557,032 |
4.20B |
|
ResNet-101 |
\((N,3,224,224)\) |
44,549,160 |
7.97B |
|
ResNet-152 |
\((N,3,224,224)\) |
60,192,808 |
11.75B |
|
ResNet-200 |
\((N,3,224,224)\) |
64,669,864 |
15.35B |
|
ResNet-269 |
\((N,3,224,224)\) |
102,069,416 |
20.46B |
|
ResNet-1001 |
\((N,3,224,224)\) |
149,071,016 |
43.94B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
WideResNet-50 |
\((N,3,224,224)\) |
78,973,224 |
11.55B |
|
WideResNet-101 |
\((N,3,224,224)\) |
126,886,696 |
22.97B |
ResNeXt¶
ConvNet Image Classification
ResNeXt is an extension of the ResNet architecture that introduces a cardinality dimension to the model, improving its performance and efficiency by allowing flexible aggregation of transformations. ResNeXt builds on residual blocks by incorporating grouped convolutions, enabling parallel pathways for feature learning.
Xie, Saining, et al. “Aggregated Residual Transformations for Deep Neural Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987-5995.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ResNeXt-50-32x4d |
\((N,3,224,224)\) |
25,028,904 |
4.38B |
|
ResNeXt-101-32x4d |
\((N,3,224,224)\) |
44,177,704 |
8.19B |
|
ResNeXt-101-32x8d |
\((N,3,224,224)\) |
88,791,336 |
16.73B |
|
ResNeXt-101-32x16d |
\((N,3,224,224)\) |
194,026,792 |
36.68B |
|
ResNeXt-101-32x32d |
\((N,3,224,224)\) |
468,530,472 |
88.03B |
|
ResNeXt-101-64x4d |
\((N,3,224,224)\) |
83,455,272 |
15.78B |
SENet¶
ConvNet Image Classification
SENets (Squeeze-and-Excitation Networks) are deep neural network architectures that enhance t he representational power of models by explicitly modeling channel interdependencies. They introduce a novel “squeeze-and-excitation” block, which adaptively recalibrates channel-wise feature responses.
Hu, Jie, et al. “Squeeze-and-Excitation Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
SE-ResNet-18 |
\((N,3,224,224)\) |
11,778,592 |
1.84B |
|
SE-ResNet-34 |
\((N,3,224,224)\) |
21,958,868 |
3.71B |
|
SE-ResNet-50 |
\((N,3,224,224)\) |
28,088,024 |
4.22B |
|
SE-ResNet-101 |
\((N,3,224,224)\) |
49,326,872 |
8.00B |
|
SE-ResNet-152 |
\((N,3,224,224)\) |
66,821,848 |
11.80B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
SE-ResNeXt-50-32x4d |
\((N,3,224,224)\) |
27,559,896 |
4.40B |
|
SE-ResNeXt-101-32x4d |
\((N,3,224,224)\) |
48,955,416 |
8.22B |
|
SE-ResNeXt-101-32x8d |
\((N,3,224,224)\) |
93,569,048 |
16.77B |
|
SE-ResNeXt-101-64x4d |
\((N,3,224,224)\) |
88,232,984 |
15.81B |
SKNet¶
ConvNet Image Classification
SKNet (Selective Kernel Networks) is a deep learning architecture that enhances the representational capacity of neural networks by enabling dynamic selection of kernel sizes in convolutional layers. It introduces the concept of a “selective kernel” module, which allows the network to adaptively choose the most appropriate receptive field for each spatial location in an image, improving its ability to capture multi-scale features.
Li, X., Zhang, S., & Wang, X. (2019). “Selective Kernel Networks.” Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 510-519.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
SK-ResNet-18 |
\((N,3,224,224)\) |
25,647,368 |
3.92B |
|
SK-ResNet-34 |
\((N,3,224,224)\) |
45,895,512 |
7.64B |
|
SK-ResNet-50 |
\((N,3,224,224)\) |
57,073,368 |
9.35B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
SK-ResNeXt-50-32x4d |
\((N,3,224,224)\) |
29,274,760 |
5.04B |
DenseNet¶
ConvNet Image Classification
A deep learning architecture designed to improve the flow of information and gradients in neural networks by introducing dense connectivity between layers. It leverages the concept of “dense blocks,” where each layer is directly connected to all preceding layers within the block. This dense connectivity pattern enhances feature reuse, reduces the number of parameters, and improves the efficiency of gradient propagation during training.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). “Densely Connected Convolutional Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700-4708.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
DenseNet-121 |
\((N,3,224,224)\) |
7,978,856 |
2.99B |
|
DenseNet-169 |
\((N,3,224,224)\) |
14,149,480 |
3.55B |
|
DenseNet-201 |
\((N,3,224,224)\) |
20,013,928 |
4.54B |
|
DenseNet-264 |
\((N,3,224,224)\) |
33,337,704 |
6.09B |
Xception¶
ConvNet Image Classification
A deep learning architecture that introduces depthwise separable convolutions to enhance efficiency and accuracy in convolutional neural networks. It builds on the idea that spatial and channel-wise information can be decoupled, significantly reducing computational cost while maintaining performance.
Chollet, F. (2017). “Xception: Deep Learning with Depthwise Separable Convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1251-1258.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
Xception |
\((N,3,224,224)\) |
22,862,096 |
4.67B |
MobileNet¶
ConvNet Image Classification
A deep learning architecture that introduces depthwise separable convolutions to enhance efficiency and accuracy in convolutional neural networks. It builds on the idea that spatial and channel-wise information can be decoupled, significantly reducing computational cost while maintaining performance.
MobileNet
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” arXiv preprint arXiv:1704.04861.
MobileNet-v2
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). “MobileNetV2: Inverted Residuals and Linear Bottlenecks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520.
MobileNet-v3
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q., & Adam, H. (2019). “Searching for MobileNetV3.” Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1314-1324.
MobileNet-v4
Zhang, Wei, et al. “MobileNet-v4: Advancing Efficiency for Mobile Vision.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5720-5730.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
MobileNet |
\((N,3,224,224)\) |
4,232,008 |
584.08M |
|
MobileNet-v2 |
\((N,3,224,224)\) |
3,504,872 |
367.39M |
|
MobileNet-v3-Small |
\((N,3,224,224)\) |
2,537,238 |
73.88M |
|
MobileNet-v3-Large |
\((N,3,224,224)\) |
5,481,198 |
266.91M |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
MobileNet-v4-Conv-Small |
\((N,3,224,224)\) |
3,774,024 |
265.15M |
|
MobileNet-v4-Conv-Medium |
\((N,3,224,224)\) |
9,715,512 |
944.48M |
|
MobileNet-v4-Conv-Large |
\((N,3,224,224)\) |
32,590,864 |
2.32B |
|
MobileNet-v4-Hybrid-Medium |
\((N,3,224,224)\) |
11,070,136 |
1.09B |
|
MobileNet-v4-Hybrid-Large |
\((N,3,224,224)\) |
37,755,152 |
2.72B |
EfficientNet¶
ConvNet Image Classification
EfficientNet is a family of convolutional neural networks optimized for scalability and performance by systematically balancing network depth, width, and resolution. It achieves state-of-the-art accuracy with fewer parameters and computational resources compared to previous architectures.
EfficientNet
Tan, Mingxing, and Quoc V. Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 6105-6114.
EfficientNet-v2
Tan, Mingxing, and Quoc V. Le. “EfficientNetV2: Smaller Models and Faster Training.” Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 10096-10106.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
EfficientNet-B0 |
\((N,3,224,224)\) |
5,289,636 |
463.32M |
|
EfficientNet-B1 |
\((N,3,240,240)\) |
7,795,560 |
849.06M |
|
EfficientNet-B2 |
\((N,3,260,260)\) |
9,111,370 |
1.20B |
|
EfficientNet-B3 |
\((N,3,300,300)\) |
12,235,536 |
2.01B |
|
EfficientNet-B4 |
\((N,3,380,380)\) |
19,344,640 |
4.63B |
|
EfficientNet-B5 |
\((N,3,456,456)\) |
30,393,432 |
12.17B |
|
EfficientNet-B6 |
\((N,3,528,528)\) |
43,046,128 |
21.34B |
|
EfficientNet-B7 |
\((N,3,600,600)\) |
66,355,448 |
40.31B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
EfficientNet-v2-S |
\((N,3,224,224)\) |
21,136,440 |
789.91M |
|
EfficientNet-v2-M |
\((N,3,224,224)\) |
55,302,108 |
1.42B |
|
EfficientNet-v2-L |
\((N,3,224,224)\) |
120,617,032 |
3.17B |
|
EfficientNet-v2-XL |
\((N,3,224,224)\) |
210,221,568 |
4.12B |
ResNeSt¶
ConvNet Image Classification
ResNeSt introduces Split Attention Blocks, which divide feature maps into groups, compute attention for each group, and reassemble them to enhance representational power. It extends ResNet by integrating these blocks, achieving improved performance in image recognition tasks with minimal computational overhead.
Zhang, Hang, et al. “ResNeSt: Split-Attention Networks.” arXiv preprint arXiv:2004.08955, 2020.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ResNeSt-14 |
\((N,3,224,224)\) |
10,611,560 |
2.82B |
|
ResNeSt-26 |
\((N,3,224,224)\) |
17,069,320 |
3.72B |
|
ResNeSt-50 |
\((N,3,224,224)\) |
27,483,112 |
5.52B |
|
ResNeSt-101 |
\((N,3,224,224)\) |
48,274,760 |
10.43B |
|
ResNeSt-200 |
\((N,3,224,224)\) |
70,201,288 |
17.85B |
|
ResNeSt-269 |
\((N,3,224,224)\) |
110,929,224 |
22.98B |
|
ResNeSt-50-4s2x40d |
\((N,3,224,224)\) |
30,417,464 |
5.41B |
|
ResNeSt-50_1s4x24d |
\((N,3,224,224)\) |
25,676,872 |
5.14B |
ConvNeXt¶
ConvNet Image Classification
ConvNeXt reimagines CNNs using principles inspired by vision transformers, streamlining architectural design while preserving the efficiency of traditional CNNs. It introduces design elements like simplified stem stages, inverted bottlenecks, and expanded kernel sizes to enhance feature extraction.
ConvNeXt
Liu, Zhuang, et al. “A ConvNet for the 2020s.” arXiv preprint arXiv:2201.03545, 2022.
ConvNeXt-v2
Liu, Ze, et al. “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders.” arXiv preprint arXiv:2301.00808, 2023.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ConvNeXt-Tiny |
\((N,3,224,224)\) |
28,589,128 |
4.73B |
|
ConvNeXt-Small |
\((N,3,224,224)\) |
46,884,148 |
8.46B |
|
ConvNeXt-Base |
\((N,3,224,224)\) |
88,591,464 |
15.93B |
|
ConvNeXt-Large |
\((N,3,224,224)\) |
197,767,336 |
35.23B |
|
ConvNeXt-XLarge |
\((N,3,224,224)\) |
350,196,968 |
62.08B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ConvNeXt-v2-Atto |
\((N,3,224,224)\) |
3,708,400 |
641.87M |
|
ConvNeXt-v2-Femto |
\((N,3,224,224)\) |
5,233,240 |
893.05M |
|
ConvNeXt-v2-Pico |
\((N,3,224,224)\) |
9,066,280 |
1.52B |
|
ConvNeXt-v2-Nano |
\((N,3,224,224)\) |
15,623,800 |
2.65B |
|
ConvNeXt-v2-Tiny |
\((N,3,224,224)\) |
28,635,496 |
4.79B |
|
ConvNeXt-v2-Base |
\((N,3,224,224)\) |
88,717,800 |
16.08B |
|
ConvNeXt-v2-Large |
\((N,3,224,224)\) |
197,956,840 |
35.64B |
|
ConvNeXt-v2-Huge |
\((N,3,224,224)\) |
660,289,640 |
120.89B |
InceptionNeXt¶
ConvNet Image Classification
InceptionNeXt extends the Inception architecture by incorporating modern design principles inspired by vision transformers. It refines multi-scale feature extraction through dynamic kernel selection, depthwise convolutions, and enhanced normalization techniques, preserving computational efficiency while improving performance across diverse vision tasks.
Yu, Weihao, et al. “InceptionNeXt: When Inception Meets ConvNeXt.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5672-5683.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
InceptionNeXt-Atto |
\((N,3,224,224)\) |
4,156,520 |
582.25M |
|
InceptionNeXt-Tiny |
\((N,3,224,224)\) |
28,083,832 |
4.48B |
|
InceptionNeXt-Small |
\((N,3,224,224)\) |
49,431,544 |
8.82B |
|
InceptionNeXt-Base |
\((N,3,224,224)\) |
86,748,840 |
15.47B |
CoAtNet¶
ConvNet Image Classification
CoAtNet extends the hybrid architecture paradigm by integrating convolutional and transformer-based designs. It enhances representation learning through hierarchical feature extraction, leveraging early-stage depthwise convolutions for locality and later-stage self-attention for global context. With relative position encoding, pre-normalization, and an optimized scaling strategy, CoAtNet achieves superior efficiency and performance across various vision tasks.
Dai, Zihang, et al. “CoAtNet: Marrying Convolution and Attention for All Data Sizes.” Advances in Neural Information Processing Systems, 2021, pp. 3965-3977.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
CoAtNet-0 |
\((N,3,224,224)\) |
27,174,944 |
5.52B |
|
CoAtNet-1 |
\((N,3,224,224)\) |
53,330,240 |
12.32B |
|
CoAtNet-2 |
\((N,3,224,224)\) |
82,516,096 |
19.72B |
|
CoAtNet-3 |
\((N,3,224,224)\) |
157,790,656 |
37.17B |
|
CoAtNet-4 |
\((N,3,224,224)\) |
277,301,632 |
66.79B |
|
CoAtNet-5 |
\((N,3,224,224)\) |
770,124,608 |
189.34B |
|
CoAtNet-6 |
\((N,3,224,224)\) |
2,011,558,336 |
293.51B |
|
CoAtNet-7 |
\((N,3,224,224)\) |
3,107,978,688 |
364.71B |
Visual Transformer (ViT)¶
Transformer Vision Transformer Image Classification
The Vision Transformer (ViT) is a deep learning architecture introduced by Dosovitskiy et al. in 2020, designed for image recognition tasks using self-attention mechanisms. Unlike traditional convolutional neural networks (CNNs), ViT splits an image into fixed-size patches, processes them as a sequence, and applies Transformer layers to capture global dependencies.
Dosovitskiy, Alexey, et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations (ICLR), 2020.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
ViT-Ti |
\((N,3,224,224)\) |
5,717,416 |
1.36B |
|
ViT-S |
\((N,3,224,224)\) |
22,050,664 |
4.81B |
|
ViT-B |
\((N,3,224,224)\) |
86,567,656 |
17.99B |
|
ViT-L |
\((N,3,224,224)\) |
304,326,632 |
62.69B |
|
ViT-H |
\((N,3,224,224)\) |
632,199,400 |
169.45B |
Swin Transformer¶
Transformer Vision Transformer Image Classification
The Swin Transformer is a hierarchical vision transformer introduced by Liu et al. in 2021, designed for image recognition and dense prediction tasks using self-attention mechanisms within shifted local windows. Unlike traditional convolutional neural networks (CNNs) and the original Vision Transformer (ViT)—which splits an image into fixed-size patches and processes them as a flat sequence—the Swin Transformer divides the image into non-overlapping local windows and computes self-attention within each window.
Swin Transformer
Liu, Ze, et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” arXiv preprint arXiv:2103.14030 (2021).
Swin Transformer-v2
Liu, Ze, et al. “Swin Transformer V2: Scaling Up Capacity and Resolution.” arXiv preprint arXiv:2111.09883 (2021).
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
Swin-T |
\((N,3,224,224)\) |
28,288,354 |
4.95B |
|
Swin-S |
\((N,3,224,224)\) |
49,606,258 |
9.37B |
|
Swin-B |
\((N,3,224,224)\) |
87,768,224 |
16.35B |
|
Swin-L |
\((N,3,224,224)\) |
196,532,476 |
36.08B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
Swin-v2-T |
\((N,3,224,224)\) |
28,349,842 |
5.01B |
|
Swin-v2-S |
\((N,3,224,224)\) |
49,731,106 |
9.48B |
|
Swin-v2-B |
\((N,3,224,224)\) |
87,922,400 |
16.49B |
|
Swin-v2-L |
\((N,3,224,224)\) |
196,745,308 |
36.29B |
|
Swin-v2-H |
\((N,3,224,224)\) |
657,796,668 |
119.42B |
|
Swin-v2-G |
\((N,3,224,224)\) |
3,000,869,564 |
531.67B |
Convolutional Transformer (CvT)¶
Transformer Vision Transformer Image Classification
CvT (Convolutional Vision Transformer) combines self-attention with depthwise convolutions to improve local feature extraction and computational efficiency. This hybrid design retains the global modeling capabilities of Vision Transformers while enhancing inductive biases, making it effective for image classification and dense prediction tasks.
Wu, Haiping, et al. “CvT: Introducing Convolutions to Vision Transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22-31.
Pyramid Vision Transformer (PVT)¶
Transformer Vision Transformer Image Classification
The Pyramid Vision Transformer (PVT) combines CNN-like pyramidal structures with Transformer attention, capturing multi-scale features efficiently. It reduces spatial resolution progressively and uses spatial-reduction attention (SRA) to enhance performance in dense prediction tasks like detection and segmentation.
PVT
Wang, Wenhai, et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv, 2021, arXiv:2102.12122.
PVT-v2
Wang, Wenhai, et al. PVTv2: Improved baselines with pyramid vision transformer. Computational Visual Media 8.3 (2022): 415-424.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
PVT-Tiny |
\((N,3,224,224)\) |
12,457,192 |
2.02B |
|
PVT-Small |
\((N,3,224,224)\) |
23,003,048 |
3.93B |
|
PVT-Medium |
\((N,3,224,224)\) |
41,492,648 |
6.66B |
|
PVT-Large |
\((N,3,224,224)\) |
55,359,848 |
8.71B |
|
PVT-Huge |
\((N,3,224,224)\) |
286,706,920 |
48.63B |
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
PVT-v2-B0 |
\((N,3,224,224)\) |
3,666,760 |
677.67M |
|
PVT-v2-B1 |
\((N,3,224,224)\) |
14,009,000 |
2.32B |
|
PVT-v2-B2 |
\((N,3,224,224)\) |
25,362,856 |
4.39B |
|
PVT-v2-B2-Linear |
\((N,3,224,224)\) |
22,553,512 |
4.27B |
|
PVT-v2-B3 |
\((N,3,224,224)\) |
45,238,696 |
7.39B |
|
PVT-v2-B4 |
\((N,3,224,224)\) |
62,556,072 |
10.80B |
|
PVT-v2-B5 |
\((N,3,224,224)\) |
82,882,984 |
13.47B |
CrossViT¶
Transformer Vision Transformer Image Classification
CrossViT is a vision transformer architecture that combines multi-scale tokenization by processing input images at different resolutions in parallel, enabling it to capture both fine-grained and coarse-grained visual features. It uses a novel cross-attention mechanism to fuse information across these scales, improving performance on image recognition tasks.
Chen, Chun-Fu, Quanfu Fan, and Rameswar Panda. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. arXiv, 2021. arXiv:2103.14899.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
CrossViT-Ti |
\((N,3,224,224)\) |
7,014,800 |
1.73B |
|
CrossViT-S |
\((N,3,224,224)\) |
26,856,272 |
5.94B |
|
CrossViT-B |
\((N,3,224,224)\) |
105,025,232 |
21.85B |
|
CrossViT-9 |
\((N,3,224,224)\) |
8,553,296 |
2.01B |
|
CrossViT-15 |
\((N,3,224,224)\) |
27,528,464 |
6.13B |
|
CrossViT-18 |
\((N,3,224,224)\) |
43,271,408 |
9.48B |
|
CrossViT-9† |
\((N,3,224,224)\) |
8,776,592 |
2.15B |
|
CrossViT-15† |
\((N,3,224,224)\) |
28,209,008 |
6.45B |
|
CrossViT-18† |
\((N,3,224,224)\) |
44,266,976 |
9.93B |
MaxViT¶
Transformer Vision Transformer Image Classification
MaxViT is a hybrid vision architecture that combines convolution, windowed attention, and grid-based attention in a multi-axis design. This hierarchical structure enables MaxViT to efficiently capture both local and global dependencies, making it effective for various vision tasks with high accuracy and scalability.
Tu, Zihang, et al. MaxViT: Multi-Axis Vision Transformer. arXiv, 2022, arXiv:2204.01697.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
---|---|---|---|---|
MaxViT-T |
\((N,3,224,224)\) |
25,081,416 |
5.60B |
|
MaxViT-S |
\((N,3,224,224)\) |
55,757,304 |
10.59B |
|
MaxViT-B |
\((N,3,224,224)\) |
96,626,776 |
21.83B |
|
MaxViT-L |
\((N,3,224,224)\) |
171,187,880 |
38.51B |
|
MaxViT-XL |
\((N,3,224,224)\) |
383,734,024 |
83.74B |
To be implemented…🔮