YOLO-v3

ConvNet One-Stage Detector Object Detection

class lucid.models.YOLO_V3(num_classes: int, anchors: list[tuple[int, int]] | None = None, image_size: int = 416, darknet: Module | None = None, darknet_out_channels_arr: list[int] | None = None)

The YOLO_V3 class implements the YOLO-v3 object detection model, extending YOLO-v2 by using multi-scale feature maps, residual connections, and deeper backbones (Darknet-53).

YOLO-v3 architecture

Class Signature

class YOLO_V3(
    num_classes: int,
    anchors: list[tuple[float, float]] | None = None,
    image_size: int = 416,
    darknet: nn.Module | None = None,
)

Parameters

  • num_classes (int): Number of object classes for detection.

  • anchors (list[tuple[float, float]], optional): List of predefined anchor box sizes. If None, the model uses the default 9 YOLO-v3 anchors.

  • darknet (nn.Module, optional): Optional custom Darknet-53-style backbone. If not provided, the model uses the default one.

    Important

    To pre-train Darknet-53 as a classification task, set classification=True in the forward pass of YOLO_V3.darknet. This returns the classification logits rather than multi-scale feature maps for detection tasks.

  • image_size (int): Size of the input image (default is 416).

Attributes

  • darknet (nn.Module): Feature extraction backbone, typically Darknet-53.

Methods

YOLO_V3.forward(x: Tensor) tuple[Tensor]
YOLO_V3.get_loss(x: Tensor, target: tuple[Tensor]) Tensor
YOLO_V3.predict(x: Tensor, conf_thresh: float = 0.5, iou_thresh: float = 0.5) list[list[DetectionDict]]

Multi-Scale Detection

YOLO-v3 performs detection at three different scales, targeting small, medium, and large objects by upsampling and concatenating intermediate features:

  • 13x13 grid (stride=32): large objects

  • 26x26 grid (stride=16): medium objects

  • 52x52 grid (stride=8): small objects

Each detection head outputs 3 bounding boxes per grid cell, using specific anchor subsets.

Input Format

The target should be a tuple of 3 elements (one for each scale), each with shape:

(N, Hs, Ws, B * (5 + C))

Where:

  • Hs, Ws are the grid size of the scale (13, 26, or 52 for input 416),

  • B is the number of anchors (typically 3 per scale),

  • C is number of classes.

Each vector at \((i, j)\) of shape \((B * (5 + C))\) contains:

  • For each box \(b\): \((t_x, t_y, t_w, t_h, obj, cls_1, \cdots, cls_C)\)

Where:

  • \(t_x, t_y\): offset of box center within the cell (\(\in [0,1]\))

  • \(t_w, t_h\): log-scale of box size relative to anchor (log((gw/s)/aw)) (canonical YOLO-v3 form)

  • \(obj\): 1 if anchor is responsible for object, else 0

  • \(cls_{1\ldots }C\): one-hot class vector

YOLO-v3 Loss

YOLO-v3 uses an anchor-based multi-part loss across three scales. Each scale contributes to the final loss by computing coordinate, objectness, and class probabilities.

\[\begin{split}\begin{aligned} \mathcal{L} &= \sum_{i,j,b} \mathbb{1}_{ijb}^{obj} \alpha_{ijb} \left[ (\sigma(\hat{t}_{x,ijb}) - t_{x,ijb})^2 + (\sigma(\hat{t}_{y,ijb}) - t_{y,ijb})^2 + (\hat{t}_{w,ijb} - t_{w,ijb})^2 + (\hat{t}_{h,ijb} - t_{h,ijb})^2 \right] \\\\ &\quad+ \sum_{i,j,b} \left[ \mathbb{1}_{ijb}^{obj}(\hat{C}_{ijb} - 1)^2 + \mathbb{1}_{ijb}^{noobj}\hat{C}_{ijb}^2 \right] \\\\ &\quad+ \sum_{i,j,b} \mathbb{1}_{ijb}^{obj} \sum_c \text{BCE}(\hat{p}_{ijb}(c), p_{ijb}(c)) \end{aligned}\end{split}\]

Where:

  • \(\hat{t}_{x,y,w,h}\) are raw outputs

  • \(t_{x,y}\) are cell-relative offsets

  • \(t_{w,h}\) are log-ratio targets (canonical encoding)

  • \(\hat{C}\) is objectness after sigmoid

  • \(\hat{p}(c)\) is predicted class prob after sigmoid

Prediction Output

The predict method applies decoding, confidence thresholding, and non-maximum suppression (NMS). It returns a list of detections per image, where each detection is a dictionary:

  • “box”: Tensor [x1, y1, x2, y2] in image pixels

  • “score”: objectness * class prob

  • “class_id”: predicted class index

Example Usage

Using YOLO-V3 with default anchors and backbone

>>> from lucid.models import YOLO_V3
>>> model = YOLO_V3(num_classes=80)
>>> x = lucid.random.rand(2, 3, 416, 416)
>>> preds = model.predict(x)
>>> print(preds[0][0])

Backward Propagation

The YOLO-V3 model supports gradient backpropagation through all layers:

>>> x = lucid.random.rand(1, 3, 416, 416, requires_grad=True)
>>> targets = (...)  # 3-scale tuple of target tensors
>>> loss = model.get_loss(x, targets)
>>> loss.backward()
>>> print(x.grad)