Object Detection¶

R-CNN¶

ConvNet Two-Stage Detector Object Detection

R-CNN (Region-based CNN) detects objects by first generating region proposals using Selective Search, then classifies each using a shared CNN. It combines region warping, feature extraction, and per-region classification with Non-Maximum Suppression.

Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition (2014): 580-587.

Name	Model	Input Shape
R-CNN	RCNN	\((N,C_{in},H,W)\)

Fast R-CNN¶

ConvNet Two-Stage Detector Object Detection

Fast R-CNN improves upon R-CNN by computing the feature map once for the entire image, then pooling features from proposed regions using RoI Pooling. It unifies classification and bounding box regression into a single network with a shared backbone.

Girshick, Ross. “Fast R-CNN.” Proceedings of the IEEE international conference on computer vision (2015): 1440-1448.

Name	Model	Input Shape
Fast R-CNN	FastRCNN	\((N,C_{in},H,W)\)

Faster R-CNN¶

ConvNet Two-Stage Detector Object Detection

Faster R-CNN builds on Fast R-CNN by introducing a Region Proposal Network (RPN) that shares convolutional features with the detection head, enabling end-to-end training and real-time inference.

Ren, Shaoqing et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).

Name	Model	Input Shape	Parameter Count
Faster R-CNN	FasterRCNN	\((N,C_{in},H,W)\)	\(-\)
Faster R-CNN ResNet-50 FPN	faster_rcnn_resnet_50_fpn	\((N,3,H,W)\)	43,515,902
Faster R-CNN ResNet-101 FPN	faster_rcnn_resnet_101_fpn	\((N,3,H,W)\)	62,508,030

YOLO¶

ConvNet One-Stage Detector Object Detection

YOLO is a one-stage object detector that frames detection as a single regression problem, directly predicting bounding boxes and class probabilities from full images in a single forward pass. It enables real-time detection with impressive speed and accuracy.

YOLO-v1¶

Redmon, Joseph et al. “You Only Look Once: Unified, Real-Time Object Detection.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016).

Name	Model	Input Shape	Parameter Count	FLOPs
YOLO-v1	yolo_v1	\((N,3,448,448)\)	271,716,734	404.84M
YOLO-v1-Tiny	yolo_v1_tiny	\((N,3,448,448)\)	236,720,462	302.21M

YOLO-v2¶

Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263-7271.

Name	Model	Input Shape	Parameter Count	FLOPs
YOLO-v2	yolo_v2	\((N,3,416,416)\)	21,287,133	214.26M
YOLO-v2-Tiny	yolo_v2_tiny	\((N,3,416,416)\)	15,863,821	77.45M

YOLO-v3¶

Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” arXiv preprint arXiv:1804.02767 (2018).

Name	Model	Input Shape	Parameter Count	FLOPs
YOLO-v3	yolo_v3	\((N,3,416,416)\)	62,974,149	558.71M
YOLO-v3-Tiny	yolo_v3_tiny	\((N,3,416,416)\)	23,106,933	147.93M

YOLO-v4¶

Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal Speed and Accuracy of Object Detection. 2020, arXiv:2004.10934.

Name	Model	Input Shape	Parameter Count	FLOPs
YOLO-v4	yolo_v4	\((N,3,608,608)\)	93,488,078	1.41B

EfficientDet¶

ConvNet One-Stage Detector Object Detection

EfficientDet is a family of object detectors that use EfficientNet backbones and a BiFPN for multi-scale feature fusion, applying compound scaling to balance accuracy and efficiency across models D0-D7, achieving strong performance with fewer parameters.

Tan, Mingxing, Ruoming Pang, and Quoc V. Le. EfficientDet: Scalable and Efficient Object Detection. CVPR 2020, 2020.

Name	Model	Input Shape	Parameter Count	FLOPs
EfficientDet-D0	efficientdet_d0	\((N,3,512,512)\)	3,591,656	2.50B
EfficientDet-D1	efficientdet_d1	\((N,3,640,640)\)	5,068,752	6.10B
EfficientDet-D2	efficientdet_d2	\((N,3,768,768)\)	6,457,434	11.00B
EfficientDet-D3	efficientdet_d3	\((N,3,896,896)\)	10,286,134	24.90B
EfficientDet-D4	efficientdet_d4	\((N,3,1024,1024)\)	18,740,232	55.20B
EfficientDet-D5	efficientdet_d5	\((N,3,1280,1280)\)	29,882,556	136.00B
EfficientDet-D6	efficientdet_d6	\((N,3,1280,1280)\)	52,634,622	222.00B
EfficientDet-D7	efficientdet_d7	\((N,3,1536,1536)\)	87,173,148	325.00B

DETR¶

Transformer Detection Transformer Object Detection

DETR (DEtection TRansformer) is a fully end-to-end object detector that replaces hand-crafted components like anchor generation and NMS with a Transformer architecture. It directly predicts a fixed set of objects using bipartite matching and set-based global loss, unifying object detection with sequence modeling.

Carion, Nicolas, et al. End-to-End Object Detection with Transformers. ECCV 2020, 2020.

Name	Model	Input Shape	Parameter Count	FLOPs
DETR-R50	detr_r50	\((N,3,800,\le 1312)\)	41,578,400	88.13B
DETR-R101	detr_r101	\((N,3,800,\le 1312)\)	60,570,528	167.21B

To be implemented…🔮