Faster R-CNN¶

ConvNet Two-Stage Detector Object Detection

class lucid.models.FasterRCNN(backbone: Module, feat_channels: int, num_classes: int, *, use_fpn: bool = False, anchor_sizes: tuple[int, ...] = (128, 256, 512), aspect_ratios: tuple[float, ...] = (0.5, 1.0, 2.0), anchor_stride: int = 16, pool_size: tuple[int, int] = (7, 7), hidden_dim: int = 4096, dropout: float = 0.5)¶

FasterRCNN implements the Faster Region-based Convolutional Neural Network, an improvement over Fast R-CNN that introduces a learnable Region Proposal Network (RPN). This architecture eliminates the need for external proposal methods by jointly learning region proposals and object classification in a unified network.

Class Signature¶

class FasterRCNN(nn.Module):
    def __init__(
        self,
        backbone: nn.Module,
        feat_channels: int,
        num_classes: int,
        *,
        use_fpn: bool = False,
        anchor_sizes: tuple[int, ...] = (128, 256, 512),
        aspect_ratios: tuple[float, ...] = (0.5, 1.0, 2.0),
        anchor_stride: int = 16,
        pool_size: tuple[int, int] = (7, 7),
        hidden_dim: int = 4096,
        dropout: float = 0.5,
    )

Parameters¶

backbone (nn.Module): Feature extraction network applied once per image to produce a feature map.
feat_channels (int): Number of output channels from the backbone’s final feature map.
num_classes (int): Number of object categories (excluding background).
use_fpn (bool, optional): Whether to use Feature Pyramid Network (FPN) for backbone feature extraction. Default is False. When set to True, the user should attach backbone model that supports FPN feature returns.
anchor_sizes (tuple[int, …], optional): Set of anchor box scales used by the RPN. Default is (128, 256, 512).
aspect_ratios (tuple[float, …], optional): Set of aspect ratios for the anchors. Default is (0.5, 1.0, 2.0).
anchor_stride (int, optional): Stride of the anchor generation relative to the backbone feature map. Default is 16.
pool_size (tuple[int, int], optional): Spatial size of each RoI feature after pooling. Default is (7, 7).
hidden_dim (int, optional): Number of units in the fully connected head. Default is 4096.
dropout (float, optional): Dropout probability used in the classification and regression head. Default is 0.5.

Architecture¶

Faster R-CNN enhances Fast R-CNN by replacing external proposal mechanisms with a learnable RPN:

Feature Map Extraction:
- The image is processed by the backbone to produce a dense feature map.
Region Proposal Network (RPN):
- Anchors are placed over the feature map.
- The RPN classifies whether each anchor contains an object and regresses its bounding box.
RoI Pooling:
- High-confidence proposals are selected and pooled to a fixed size (pool_size).
Detection Head:
- Each RoI is processed by fully connected layers for classification and bounding box refinement.
Loss Output:
- The model provides a .get_loss() method, returning a structured loss dictionary.

Loss Dictionary¶

class _FasterRCNNLoss(TypedDict):
    rpn_cls_loss: Tensor
    rpn_reg_loss: Tensor
    roi_cls_loss: Tensor
    roi_reg_loss: Tensor
    total_loss: Tensor

Returned by FasterRCNN.get_loss(), this dictionary provides detailed loss breakdowns for both RPN and RoI heads.

Examples¶

Basic Usage

import lucid.nn as nn
import lucid.random

class SimpleBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
            nn.ReLU()
        )

    def forward(self, x):
        return self.conv(x)

backbone = SimpleBackbone()
model = nn.FasterRCNN(backbone, feat_channels=128, num_classes=5)

image = lucid.random.randn(1, 3, 512, 512)
output = model.predict(image)
print(output["boxes"].shape, output["scores"].shape, output["labels"].shape)

Custom Configuration

backbone = SimpleBackbone()
model = nn.FasterRCNN(
    backbone=backbone,
    feat_channels=128,
    num_classes=5,
    anchor_sizes=(64, 128, 256),
    aspect_ratios=(0.5, 1.0),
    pool_size=(5, 5),
    hidden_dim=2048,
    dropout=0.4,
)

image = lucid.random.randn(1, 3, 384, 384)
output = model.predict(image)
print(output["boxes"].shape, output["scores"].shape, output["labels"].shape)

Tip

For training, use model.get_loss() with the predicted and ground-truth targets to compute total and component-wise loss terms.