Image Segmentation Models

FCN

ConvNet Segmentation ConvNet

FCN adapts convolutional backbones for dense semantic segmentation by replacing image-level classification heads with per-pixel prediction layers. In this implementation, a ResNet backbone produces dense features, a lightweight head predicts segmentation logits, and an optional auxiliary head can be used during training.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition (2015): 3431-3440.

Name

Model

Input Shape

Parameter Count

Pre-Trained

FCN-ResNet-50

fcn_resnet_50

\((N,3,H,W)\)

35,322,218

FCN-ResNet-101

fcn_resnet_101

\((N,3,H,W)\)

54,314,346

U-Net

ConvNet Segmentation ConvNet

U-Net is an encoder-decoder architecture for dense prediction that combines multi-scale feature extraction with skip connections between matching encoder and decoder stages. This implementation is configurable and can express a wide range of 2D and 3D segmentation variants by changing stage depth, channel widths, normalization, activation, skip merging, and sampling strategy.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” Medical Image Computing and Computer-Assisted Intervention (2015): 234-241.

Name

Model

Input Shape

Parameter Count

U-Net 2D

UNet2d

\((N,C,H,W)\)

Variable

U-Net 3D

UNet3d

\((N,C,D,H,W)\)

Variable

ResUNet

ConvNet Segmentation ConvNet

ResUNet is an encoder-decoder architecture for dense prediction that integrates residual learning units with multi-scale feature extraction and skip connections. By replacing standard convolutional blocks with shortcut connections, it facilitates deeper network training and mitigates gradient vanishing.

Zhang, Zhengxin, Qingjie Liu, and Yunhong Wang. “Road extraction by deep residual u-net.” IEEE Geoscience and Remote Sensing Letters 15.5 (2018): 749-753.

Name

Model

Input Shape

Parameter Count

ResUNet 2D

ResUNet2d

\((N,C,H,W)\)

Variable

ResUNet 3D

ResUNet3d

\((N,C,D,H,W)\)

Variable

Attention U-Net

ConvNet Segmentation ConvNet

Attention U-Net extends the encoder-decoder structure of U-Net with additive attention gates on skip connections. Decoder-side gating signals suppress irrelevant encoder responses before concatenation, improving localization while preserving the familiar multi-scale segmentation pipeline.

Oktay, Ozan, et al. “Attention U-Net: Learning Where to Look for the Pancreas.” arXiv preprint arXiv:1804.03999 (2018).

Name

Model

Input Shape

Parameter Count

Attention U-Net 2D

AttentionUNet2d

\((N,C,H,W)\)

Variable

Attention U-Net 3D

AttentionUNet3d

\((N,C,D,H,W)\)

Variable

Mask R-CNN

ConvNet Two-Stage Detector

Mask R-CNN extends Faster R-CNN with a parallel mask prediction branch to perform instance-level segmentation in addition to bounding box detection and classification. The model first proposes candidate object regions, then predicts class labels, box refinements, and a binary mask for each detected instance.

He, Kaiming, et al. “Mask R-CNN.” Proceedings of the IEEE International Conference on Computer Vision (2017): 2961-2969.

Name

Model

Input Shape

Parameter Count

Pre-Trained

Mask R-CNN ResNet-50 FPN

mask_rcnn_resnet_50_fpn

\((N,3,H,W)\)

46,037,607

Mask R-CNN ResNet-101 FPN

mask_rcnn_resnet_101_fpn

\((N,3,H,W)\)

65,126,611

MaskFormer

Transformer Segmentation Transformer

MaskFormer reformulates segmentation as mask classification with Transformer queries. A CNN backbone and pixel decoder produce dense features, and each query predicts a class and a mask embedding that is projected to segmentation logits.

Cheng, Bowen, et al. “Per-Pixel Classification is Not All You Need for Semantic Segmentation.” Advances in Neural Information Processing Systems 34 (2021): 17864-17875.

Name

Model

Input Shape

Parameter Count

Pre-Trained

MaskFormer-ResNet-18

maskformer_resnet_18

\((N,3,H,W)\)

24,700,119

MaskFormer-ResNet-34

maskformer_resnet_34

\((N,3,H,W)\)

34,808,279

MaskFormer-ResNet-50

maskformer_resnet_50

\((N,3,H,W)\)

41,307,863

MaskFormer-ResNet-101

maskformer_resnet_101

\((N,3,H,W)\)

60,299,991

Mask2Former

Transformer Segmentation Transformer

Mask2Former introduces masked attention and multi-scale decoding for universal segmentation. Query masks from each decoder stage are reused as attention constraints for the next stage.

Cheng, Bowen, et al. “Masked-attention Mask Transformer for Universal Image Segmentation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022): 1290-1299.

Name

Model

Input Shape

Parameter Count

Pre-Trained

Mask2Former-ResNet-18

mask2former_resnet_18

\((N,3,H,W)\)

30,972,567

Mask2Former-ResNet-34

mask2former_resnet_34

\((N,3,H,W)\)

41,080,727

Mask2Former-ResNet-50

mask2former_resnet_50

\((N,3,H,W)\)

44,041,367

Mask2Former-ResNet-101

mask2former_resnet_101

\((N,3,H,W)\)

63,033,495

Mask2Former-Swin-Tiny

mask2former_swin_tiny

\((N,3,224,224)\)

47,439,633

Mask2Former-Swin-Small

mask2former_swin_small

\((N,3,224,224)\)

68,757,537

Mask2Former-Swin-Base

mask2former_swin_base

\((N,3,384,384)\)

106,922,191

Mask2Former-Swin-Large

mask2former_swin_large

\((N,3,384,384)\)

215,488,779