Image Classification¶

LeNet¶

ConvNet Image Classification

LeNet is a pioneering CNN by Yann LeCun for digit recognition, combining convolutional, pooling, and fully connected layers. It introduced concepts like weight sharing and local receptive fields, shaping modern CNNs.

Lecun, Yann, et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE, vol. 86, no. 11, Nov. 1998, pp. 2278-2324. doi:10.1109/5.726791.

Name	Model	Input Shape	Parameter Count	FLOPs
LeNet-1	lenet_1	\((N,1,28,28)\)	3,246	167.13K
LeNet-4	lenet_4	\((N,1,28,28)\)	18,378	182.93K
LeNet-5	lenet_5	\((N,1,32,32)\)	61,706	481.49K

AlexNet¶

ConvNet Image Classification

AlexNet is a pioneering convolutional neural network introduced in 2012, known for its deep architecture and use of ReLU activations, dropout, and GPU acceleration. It achieved groundbreaking performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, popularizing deep learning for computer vision.

Krizhevsky, Alex, et al. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 1097-1105.

Name	Model	Input Shape	Parameter Count	FLOPs
AlexNet	alexnet	\((N,3,224,224)\)	61,100,840	715.21M

ZFNet¶

ConvNet Image Classification

ZFNet (Zeiler and Fergus Net) is a convolutional neural network that improved upon AlexNet by using smaller convolutional filters and visualizing learned features to better understand network behavior. It achieved state-of-the-art results in object recognition and provided insights into deep learning interpretability.

Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” European Conference on Computer Vision (ECCV), Springer, Cham, 2014, pp. 818-833. doi:10.1007/978-3-319-10590-1_53.

Name	Model	Input Shape	Parameter Count	FLOPs
ZFNet	zfnet	\((N,3,224,224)\)	62,357,608	1.20B

VGGNet¶

ConvNet Image Classification

VGGNet is a deep convolutional neural network known for its simplicity and use of small 3x3 convolutional filters, which significantly improved object recognition accuracy.

Simonyan, Karen, and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv preprint arXiv:1409.1556, 2014.

Name	Model	Input Shape	Parameter Count	FLOPs
VGGNet-11	vggnet_11	\((N,3,224,224)\)	132,863,336	7.62B
VGGNet-13	vggnet_13	\((N,3,224,224)\)	133,047,848	11.33B
VGGNet-16	vggnet_16	\((N,3,224,224)\)	138,357,544	15.50B
VGGNet-19	vggnet_19	\((N,3,224,224)\)	143,667,240	19.66B

Inception¶

ConvNet Image Classification

The Inception architecture, introduced in the GoogLeNet model, is a deep convolutional neural network designed for efficient feature extraction using parallel convolutional and pooling branches, reducing computational cost. It achieves this by combining multi-scale feature processing within each module, making it highly effective for image classification tasks.

Szegedy, Christian, et al. “Going Deeper with Convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9.

Name	Model	Input Shape	Parameter Count	FLOPs
Inception-v1 (GoogLeNet)	inception_v1	\((N,3,224,224)\)	13,393,352	1.62B
Inception-v3	inception_v3	\((N,3,299,299)\)	30,817,392	3.20B
Inception-v4	inception_v4	\((N,3,299,299)\)	40,586,984	5.75B

Inception-ResNet¶

ConvNet Image Classification

The Inception-ResNet architecture builds upon the Inception model by integrating residual connections, which improve gradient flow and training stability in very deep networks. This combination of Inception’s multi-scale feature processing with ResNet’s efficient backpropagation allows for a powerful and scalable design, suitable for a wide range of image classification tasks.

Szegedy, Christian, et al. “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.” Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 4278-4284.

Name	Model	Input Shape	Parameter Count	FLOPs
Inception-ResNet-v1	inception_resnet_v1	\((N,3,299,299)\)	22,739,128	3.16B
Inception-ResNet-v2	inception_resnet_v2	\((N,3,299,299)\)	35,847,512	4.54B

ResNet¶

ConvNet Image Classification

ResNets (Residual Networks) are deep neural network architectures that use skip connections (residual connections) to alleviate the vanishing gradient problem, enabling the training of extremely deep models. They revolutionized deep learning by introducing identity mappings, allowing efficient backpropagation and improved accuracy in tasks like image classification and object detection.

He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

He, Kaiming, et al. “Identity Mappings in Deep Residual Networks.” European Conference on Computer Vision (ECCV), Springer, 2016, pp. 630-645.

Name	Model	Input Shape	Parameter Count	FLOPs
ResNet-18	resnet_18	\((N,3,224,224)\)	11,689,512	1.84B
ResNet-34	resnet_34	\((N,3,224,224)\)	21,797,672	3.70B
ResNet-50	resnet_50	\((N,3,224,224)\)	25,557,032	4.20B
ResNet-101	resnet_101	\((N,3,224,224)\)	44,549,160	7.97B
ResNet-152	resnet_152	\((N,3,224,224)\)	60,192,808	11.75B
ResNet-200	resnet_200	\((N,3,224,224)\)	64,669,864	15.35B
ResNet-269	resnet_269	\((N,3,224,224)\)	102,069,416	20.46B
ResNet-1001	resnet_1001	\((N,3,224,224)\)	149,071,016	43.94B

Name	Model	Input Shape	Parameter Count	FLOPs
WideResNet-50	wide_resnet_50	\((N,3,224,224)\)	78,973,224	11.55B
WideResNet-101	wide_resnet_101	\((N,3,224,224)\)	126,886,696	22.97B

ResNeXt¶

ConvNet Image Classification

ResNeXt is an extension of the ResNet architecture that introduces a cardinality dimension to the model, improving its performance and efficiency by allowing flexible aggregation of transformations. ResNeXt builds on residual blocks by incorporating grouped convolutions, enabling parallel pathways for feature learning.

Xie, Saining, et al. “Aggregated Residual Transformations for Deep Neural Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987-5995.

Name	Model	Input Shape	Parameter Count	FLOPs
ResNeXt-50-32x4d	resnext_50_32x4d	\((N,3,224,224)\)	25,028,904	4.38B
ResNeXt-101-32x4d	resnext_101_32x4d	\((N,3,224,224)\)	44,177,704	8.19B
ResNeXt-101-32x8d	resnext_101_32x8d	\((N,3,224,224)\)	88,791,336	16.73B
ResNeXt-101-32x16d	resnext_101_32x16d	\((N,3,224,224)\)	194,026,792	36.68B
ResNeXt-101-32x32d	resnext_101_32x32d	\((N,3,224,224)\)	468,530,472	88.03B
ResNeXt-101-64x4d	resnext_101_64x4d	\((N,3,224,224)\)	83,455,272	15.78B

SENet¶

ConvNet Image Classification

SENets (Squeeze-and-Excitation Networks) are deep neural network architectures that enhance t he representational power of models by explicitly modeling channel interdependencies. They introduce a novel “squeeze-and-excitation” block, which adaptively recalibrates channel-wise feature responses.

Hu, Jie, et al. “Squeeze-and-Excitation Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141.

Name	Model	Input Shape	Parameter Count	FLOPs
SE-ResNet-18	se_resnet_18	\((N,3,224,224)\)	11,778,592	1.84B
SE-ResNet-34	se_resnet_34	\((N,3,224,224)\)	21,958,868	3.71B
SE-ResNet-50	se_resnet_50	\((N,3,224,224)\)	28,088,024	4.22B
SE-ResNet-101	se_resnet_101	\((N,3,224,224)\)	49,326,872	8.00B
SE-ResNet-152	se_resnet_152	\((N,3,224,224)\)	66,821,848	11.80B

Name	Model	Input Shape	Parameter Count	FLOPs
SE-ResNeXt-50-32x4d	se_resnext_50_32x4d	\((N,3,224,224)\)	27,559,896	4.40B
SE-ResNeXt-101-32x4d	se_resnext_101_32x4d	\((N,3,224,224)\)	48,955,416	8.22B
SE-ResNeXt-101-32x8d	se_resnext_101_32x8d	\((N,3,224,224)\)	93,569,048	16.77B
SE-ResNeXt-101-64x4d	se_resnext_101_64x4d	\((N,3,224,224)\)	88,232,984	15.81B

SKNet¶

ConvNet Image Classification

SKNet (Selective Kernel Networks) is a deep learning architecture that enhances the representational capacity of neural networks by enabling dynamic selection of kernel sizes in convolutional layers. It introduces the concept of a “selective kernel” module, which allows the network to adaptively choose the most appropriate receptive field for each spatial location in an image, improving its ability to capture multi-scale features.

Li, X., Zhang, S., & Wang, X. (2019). “Selective Kernel Networks.” Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 510-519.

Name	Model	Input Shape	Parameter Count	FLOPs
SK-ResNet-18	sk_resnet_18	\((N,3,224,224)\)	25,647,368	3.92B
SK-ResNet-34	sk_resnet_34	\((N,3,224,224)\)	45,895,512	7.64B
SK-ResNet-50	sk_resnet_50	\((N,3,224,224)\)	57,073,368	9.35B

Name	Model	Input Shape	Parameter Count	FLOPs
SK-ResNeXt-50-32x4d	sk_resnext_50_32x4d	\((N,3,224,224)\)	29,274,760	5.04B

DenseNet¶

ConvNet Image Classification

A deep learning architecture designed to improve the flow of information and gradients in neural networks by introducing dense connectivity between layers. It leverages the concept of “dense blocks,” where each layer is directly connected to all preceding layers within the block. This dense connectivity pattern enhances feature reuse, reduces the number of parameters, and improves the efficiency of gradient propagation during training.

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). “Densely Connected Convolutional Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700-4708.

Name	Model	Input Shape	Parameter Count	FLOPs
DenseNet-121	densenet_121	\((N,3,224,224)\)	7,978,856	2.99B
DenseNet-169	densenet_169	\((N,3,224,224)\)	14,149,480	3.55B
DenseNet-201	densenet_201	\((N,3,224,224)\)	20,013,928	4.54B
DenseNet-264	densenet_264	\((N,3,224,224)\)	33,337,704	6.09B

Xception¶

ConvNet Image Classification

A deep learning architecture that introduces depthwise separable convolutions to enhance efficiency and accuracy in convolutional neural networks. It builds on the idea that spatial and channel-wise information can be decoupled, significantly reducing computational cost while maintaining performance.

Chollet, F. (2017). “Xception: Deep Learning with Depthwise Separable Convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1251-1258.

Name	Model	Input Shape	Parameter Count	FLOPs
Xception	xception	\((N,3,224,224)\)	22,862,096	4.67B

MobileNet¶

ConvNet Image Classification

MobileNet

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” arXiv preprint arXiv:1704.04861.

MobileNet-v2

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). “MobileNetV2: Inverted Residuals and Linear Bottlenecks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520.

MobileNet-v3

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q., & Adam, H. (2019). “Searching for MobileNetV3.” Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1314-1324.

MobileNet-v4

Zhang, Wei, et al. “MobileNet-v4: Advancing Efficiency for Mobile Vision.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5720-5730.

Name	Model	Input Shape	Parameter Count	FLOPs
MobileNet	mobilenet	\((N,3,224,224)\)	4,232,008	584.08M
MobileNet-v2	mobilenet_v2	\((N,3,224,224)\)	3,504,872	367.39M
MobileNet-v3-Small	mobilenet_v3_small	\((N,3,224,224)\)	2,537,238	73.88M
MobileNet-v3-Large	mobilenet_v3_large	\((N,3,224,224)\)	5,481,198	266.91M

Name	Model	Input Shape	Parameter Count	FLOPs
MobileNet-v4-Conv-Small	mobilenet_v4_conv_small	\((N,3,224,224)\)	3,774,024	265.15M
MobileNet-v4-Conv-Medium	mobilenet_v4_conv_medium	\((N,3,224,224)\)	9,715,512	944.48M
MobileNet-v4-Conv-Large	mobilenet_v4_conv_large	\((N,3,224,224)\)	32,590,864	2.32B
MobileNet-v4-Hybrid-Medium	mobilenet_v4_hybrid_medium	\((N,3,224,224)\)	11,070,136	1.09B
MobileNet-v4-Hybrid-Large	mobilenet_v4_hybrid_large	\((N,3,224,224)\)	37,755,152	2.72B

EfficientNet¶

ConvNet Image Classification

EfficientNet is a family of convolutional neural networks optimized for scalability and performance by systematically balancing network depth, width, and resolution. It achieves state-of-the-art accuracy with fewer parameters and computational resources compared to previous architectures.

EfficientNet

Tan, Mingxing, and Quoc V. Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 6105-6114.

EfficientNet-v2

Tan, Mingxing, and Quoc V. Le. “EfficientNetV2: Smaller Models and Faster Training.” Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 10096-10106.

Name	Model	Input Shape	Parameter Count	FLOPs
EfficientNet-B0	efficientnet_b0	\((N,3,224,224)\)	5,289,636	463.32M
EfficientNet-B1	efficientnet_b1	\((N,3,240,240)\)	7,795,560	849.06M
EfficientNet-B2	efficientnet_b2	\((N,3,260,260)\)	9,111,370	1.20B
EfficientNet-B3	efficientnet_b3	\((N,3,300,300)\)	12,235,536	2.01B
EfficientNet-B4	efficientnet_b4	\((N,3,380,380)\)	19,344,640	4.63B
EfficientNet-B5	efficientnet_b5	\((N,3,456,456)\)	30,393,432	12.17B
EfficientNet-B6	efficientnet_b6	\((N,3,528,528)\)	43,046,128	21.34B
EfficientNet-B7	efficientnet_b7	\((N,3,600,600)\)	66,355,448	40.31B

Name	Model	Input Shape	Parameter Count	FLOPs
EfficientNet-v2-S	efficientnet_v2_s	\((N,3,224,224)\)	21,136,440	789.91M
EfficientNet-v2-M	efficientnet_v2_m	\((N,3,224,224)\)	55,302,108	1.42B
EfficientNet-v2-L	efficientnet_v2_l	\((N,3,224,224)\)	120,617,032	3.17B
EfficientNet-v2-XL	efficientnet_v2_xl	\((N,3,224,224)\)	210,221,568	4.12B

ResNeSt¶

ConvNet Image Classification

ResNeSt introduces Split Attention Blocks, which divide feature maps into groups, compute attention for each group, and reassemble them to enhance representational power. It extends ResNet by integrating these blocks, achieving improved performance in image recognition tasks with minimal computational overhead.

Zhang, Hang, et al. “ResNeSt: Split-Attention Networks.” arXiv preprint arXiv:2004.08955, 2020.

Name	Model	Input Shape	Parameter Count	FLOPs
ResNeSt-14	resnest_14	\((N,3,224,224)\)	10,611,560	2.82B
ResNeSt-26	resnest_26	\((N,3,224,224)\)	17,069,320	3.72B
ResNeSt-50	resnest_50	\((N,3,224,224)\)	27,483,112	5.52B
ResNeSt-101	resnest_101	\((N,3,224,224)\)	48,274,760	10.43B
ResNeSt-200	resnest_200	\((N,3,224,224)\)	70,201,288	17.85B
ResNeSt-269	resnest_269	\((N,3,224,224)\)	110,929,224	22.98B
ResNeSt-50-4s2x40d	resnest_50_4s2x40d	\((N,3,224,224)\)	30,417,464	5.41B
ResNeSt-50_1s4x24d	resnest_50_1s4x24d	\((N,3,224,224)\)	25,676,872	5.14B

ConvNeXt¶

ConvNet Image Classification

ConvNeXt reimagines CNNs using principles inspired by vision transformers, streamlining architectural design while preserving the efficiency of traditional CNNs. It introduces design elements like simplified stem stages, inverted bottlenecks, and expanded kernel sizes to enhance feature extraction.

ConvNeXt

Liu, Zhuang, et al. “A ConvNet for the 2020s.” arXiv preprint arXiv:2201.03545, 2022.

ConvNeXt-v2

Liu, Ze, et al. “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders.” arXiv preprint arXiv:2301.00808, 2023.

Name	Model	Input Shape	Parameter Count	FLOPs
ConvNeXt-Tiny	convnext_tiny	\((N,3,224,224)\)	28,589,128	4.73B
ConvNeXt-Small	convnext_small	\((N,3,224,224)\)	46,884,148	8.46B
ConvNeXt-Base	convnext_base	\((N,3,224,224)\)	88,591,464	15.93B
ConvNeXt-Large	convnext_large	\((N,3,224,224)\)	197,767,336	35.23B
ConvNeXt-XLarge	convnext_xlarge	\((N,3,224,224)\)	350,196,968	62.08B

Name	Model	Input Shape	Parameter Count	FLOPs
ConvNeXt-v2-Atto	convnext_v2_atto	\((N,3,224,224)\)	3,708,400	641.87M
ConvNeXt-v2-Femto	convnext_v2_femto	\((N,3,224,224)\)	5,233,240	893.05M
ConvNeXt-v2-Pico	convnext_v2_pico	\((N,3,224,224)\)	9,066,280	1.52B
ConvNeXt-v2-Nano	convnext_v2_nano	\((N,3,224,224)\)	15,623,800	2.65B
ConvNeXt-v2-Tiny	convnext_v2_tiny	\((N,3,224,224)\)	28,635,496	4.79B
ConvNeXt-v2-Base	convnext_v2_base	\((N,3,224,224)\)	88,717,800	16.08B
ConvNeXt-v2-Large	convnext_v2_large	\((N,3,224,224)\)	197,956,840	35.64B
ConvNeXt-v2-Huge	convnext_v2_huge	\((N,3,224,224)\)	660,289,640	120.89B

InceptionNeXt¶

ConvNet Image Classification

InceptionNeXt extends the Inception architecture by incorporating modern design principles inspired by vision transformers. It refines multi-scale feature extraction through dynamic kernel selection, depthwise convolutions, and enhanced normalization techniques, preserving computational efficiency while improving performance across diverse vision tasks.

Yu, Weihao, et al. “InceptionNeXt: When Inception Meets ConvNeXt.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5672-5683.

Name	Model	Input Shape	Parameter Count	FLOPs
InceptionNeXt-Atto	inception_next_atto	\((N,3,224,224)\)	4,156,520	582.25M
InceptionNeXt-Tiny	inception_next_tiny	\((N,3,224,224)\)	28,083,832	4.48B
InceptionNeXt-Small	inception_next_small	\((N,3,224,224)\)	49,431,544	8.82B
InceptionNeXt-Base	inception_next_base	\((N,3,224,224)\)	86,748,840	15.47B

CoAtNet¶

ConvNet Image Classification

CoAtNet extends the hybrid architecture paradigm by integrating convolutional and transformer-based designs. It enhances representation learning through hierarchical feature extraction, leveraging early-stage depthwise convolutions for locality and later-stage self-attention for global context. With relative position encoding, pre-normalization, and an optimized scaling strategy, CoAtNet achieves superior efficiency and performance across various vision tasks.

Dai, Zihang, et al. “CoAtNet: Marrying Convolution and Attention for All Data Sizes.” Advances in Neural Information Processing Systems, 2021, pp. 3965-3977.

Name	Model	Input Shape	Parameter Count	FLOPs
CoAtNet-0	coatnet_0	\((N,3,224,224)\)	27,174,944	5.52B
CoAtNet-1	coatnet_1	\((N,3,224,224)\)	53,330,240	12.32B
CoAtNet-2	coatnet_2	\((N,3,224,224)\)	82,516,096	19.72B
CoAtNet-3	coatnet_3	\((N,3,224,224)\)	157,790,656	37.17B
CoAtNet-4	coatnet_4	\((N,3,224,224)\)	277,301,632	66.79B
CoAtNet-5	coatnet_5	\((N,3,224,224)\)	770,124,608	189.34B
CoAtNet-6	coatnet_6	\((N,3,224,224)\)	2,011,558,336	293.51B
CoAtNet-7	coatnet_7	\((N,3,224,224)\)	3,107,978,688	364.71B