Mask R-CNN

ConvNet Two-Stage Detector

class lucid.models.mask_rcnn.MaskRCNN(config: MaskRCNNConfig)

MaskRCNN extends Faster R-CNN by adding a parallel instance mask head on top of RoI features. For each foreground proposal, it predicts class logits, bounding-box deltas, and a per-instance binary segmentation mask.

        %%{init: {"flowchart":{"curve":"monotoneX","nodeSpacing":50,"rankSpacing":50}} }%%
flowchart LR
  linkStyle default stroke-width:2.0px
  subgraph sg_m0["<span style='font-size:20px;font-weight:700'>mask_rcnn_resnet_50_fpn</span>"]
  style sg_m0 fill:#000000,fill-opacity:0.05,stroke:#000000,stroke-opacity:0.75,stroke-width:1px
    subgraph sg_m1["_ResNetFPNBackbone"]
    style sg_m1 fill:#000000,fill-opacity:0.05,stroke:#000000,stroke-opacity:0.75,stroke-width:1px
      m2["resnet_50"];
      m3["Sequential"];
      m4["MaxPool2d"];
      m5["FPN"];
    end
    m9["_AnchorGenerator"];
    subgraph sg_m7["_RegionProposalNetwork"]
    style sg_m7 fill:#000000,fill-opacity:0.05,stroke:#000000,stroke-opacity:0.75,stroke-width:1px
      m8["_RPNHead"];
      m9["_AnchorGenerator"];
    end
    subgraph sg_m10["MultiScaleROIAlign x 2"]
    style sg_m10 fill:#000000,fill-opacity:0.05,stroke:#000000,stroke-opacity:0.75,stroke-width:1px
      m10_in(["Input"]);
      m10_out(["Output"]);
  style m10_in fill:#e2e8f0,stroke:#64748b,stroke-width:1px;
  style m10_out fill:#e2e8f0,stroke:#64748b,stroke-width:1px;
      m11["ROIAlign"];
    end
    m12(["Linear x 2"]);
    m13(["Dropout x 2"]);
    m14(["Linear x 2"]);
    subgraph sg_m15["_MaskHead"]
    style sg_m15 fill:#000000,fill-opacity:0.05,stroke:#000000,stroke-opacity:0.75,stroke-width:1px
      m16(["Conv2d x 4"]);
      m17["ConvTranspose2d"];
      m18["Conv2d"];
    end
    m19["ROIAlign"];
  end
  input["Input"];
  output["Output"];
  style input fill:#fff3cd,stroke:#a67c00,stroke-width:1px;
  style output fill:#fff3cd,stroke:#a67c00,stroke-width:1px;
  style m4 fill:#fefcbf,stroke:#b7791f,stroke-width:1px;
  style m12 fill:#ebf8ff,stroke:#2b6cb0,stroke-width:1px;
  style m13 fill:#edf2f7,stroke:#4a5568,stroke-width:1px;
  style m14 fill:#ebf8ff,stroke:#2b6cb0,stroke-width:1px;
  style m16 fill:#ffe8e8,stroke:#c53030,stroke-width:1px;
  style m17 fill:#ffe8e8,stroke:#c53030,stroke-width:1px;
  style m18 fill:#ffe8e8,stroke:#c53030,stroke-width:1px;
  input --> m3;
  m10_in -.-> m11;
  m10_out --> m16;
  m11 --> m10_out;
  m11 -.-> m12;
  m12 --> m13;
  m13 -.-> m12;
  m13 --> m14;
  m14 --> m10_in;
  m16 --> m17;
  m17 --> m18;
  m18 --> output;
  m2 --> m5;
  m3 --> m4;
  m4 --> m2;
  m5 --> m8;
  m8 -.-> m11;
    

Class Signature

class MaskRCNN(nn.Module):
    def __init__(self, config: MaskRCNNConfig)

Parameters

  • config (MaskRCNNConfig): Configuration object that packages the backbone, proposal settings, RoI head dimensions, and mask head dimensions used to build the detector.

Configuration

  • backbone (nn.Module): Backbone feature extractor used by both proposal and RoI heads.

  • feat_channels (int): Number of channels in the feature map consumed by detection and mask heads.

  • num_classes (int): Number of target classes, excluding background.

  • use_fpn (bool): If True, uses MultiScaleROIAlign for FPN multi-level features.

  • anchor_sizes (tuple[int, …]): Anchor scales used by the RPN.

  • aspect_ratios (tuple[float, …]): Anchor aspect ratios used by the RPN.

  • anchor_stride (int): Anchor stride on the feature map.

  • pool_size (tuple[int, int]): RoI pooling resolution for classification and box regression.

  • hidden_dim (int): Hidden dimension of the two-layer MLP detection head.

  • dropout (float): Dropout probability for the detection head.

  • mask_pool_size (tuple[int, int]): RoI pooling size for mask features.

  • mask_hidden_channels (int): Hidden channels in the mask head convolution stack.

  • mask_out_size (int): Final pooled target size for masks. The current implementation requires this to equal 2 * mask_pool_size[0].

Architecture

  1. Backbone + RPN:

    • Extracts shared image features.

    • Proposes candidate boxes with objectness and box regression.

  2. RoI Detection Head:

    • Applies RoIAlign on proposals.

    • Predicts class logits and class-specific box deltas.

  3. RoI Mask Head:

    • Applies dedicated RoIAlign for mask features.

    • Predicts class-specific mask logits per proposal.

  4. Training Losses:

    • Combines RPN objectness/regression, RoI cls/regression, and mask BCE loss.

Loss Dictionary

class _MaskRCNNLoss(TypedDict):
    rpn_cls_loss: Tensor
    rpn_reg_loss: Tensor
    roi_cls_loss: Tensor
    roi_reg_loss: Tensor
    mask_loss: Tensor
    total_loss: Tensor

Methods

MaskRCNN.forward(images: Tensor, rois: Tensor | None = None, roi_idx: Tensor | None = None) tuple[Tensor, Tensor, Tensor]
MaskRCNN.predict(images: Tensor, *, score_thresh: float = 0.05, nms_thresh: float = 0.5, top_k: int = 100) list[dict[str, Tensor]]
MaskRCNN.get_loss(images: Tensor, targets: list[dict[str, Tensor]]) _MaskRCNNLoss

Examples

Basic Usage

from lucid.models.vision.mask_rcnn import MaskRCNN
import lucid
import lucid.nn as nn

class SimpleBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

from lucid.models.vision.mask_rcnn import MaskRCNNConfig

model = MaskRCNN(
    MaskRCNNConfig(backbone=SimpleBackbone(), feat_channels=128, num_classes=5)
)
x = lucid.random.randn(1, 3, 512, 512)
cls_logits, bbox_deltas, mask_logits = model(x)
print(cls_logits.shape, bbox_deltas.shape, mask_logits.shape)

Inference API

detections = model.predict(x)
print(detections[0]["boxes"].shape)
print(detections[0]["scores"].shape)
print(detections[0]["labels"].shape)
print(detections[0]["masks"].shape)