You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon; Santosh Divvala; Ross Girshick; Ali Farhadi

TL;DR

YOLO recasts object detection as a single regression problem. Instead of the multi-stage propose-then-classify pipeline used by R-CNN and its variants, a single convolutional network predicts bounding box coordinates and class probabilities directly from the full image in one forward pass. The base model runs at 45 FPS with 63.4 mAP on VOC 2007 — roughly 100x faster than Fast R-CNN at comparable accuracy. The speed comes from eliminating region proposals entirely: the network reasons globally over the image, trading some localization precision for a dramatic reduction in inference cost.

The Core Idea: Detection as Regression

Prior to YOLO, dominant detectors like R-CNN, Fast R-CNN, and Faster R-CNN operated in stages: generate region proposals, extract features from each, then classify and refine. Each stage introduced latency and complexity. Deformable Parts Models (DPM) used sliding windows with hand-crafted features, which was even slower.

YOLO takes a fundamentally different approach. The input image is divided into an S × S grid (with S = 7 in the paper). Each grid cell predicts B bounding boxes (with B = 2) and C class probabilities (with C = 20 for VOC). Each bounding box prediction consists of 5 values: center coordinates (x, y) relative to the grid cell, width and height (w, h) relative to the full image, and a confidence score reflecting both objectness and localization quality.

The confidence score is defined as:

\text{Confidence} = \Pr(\text{Object}) × \text{IoU}_\text{pred}^\text{truth}

This means the network output is a single tensor of shape S × S × (B × 5 + C), which for the VOC configuration is 7 × 7 × 30. The entire detection pipeline — feature extraction, bounding box prediction, and classification — collapses into one forward pass through a single network.

Architecture: 24 Conv Layers + 2 Fully Connected

The YOLO network is inspired by GoogLeNet but replaces Inception modules with simple reduction layers. The architecture consists of 24 convolutional layers for feature extraction followed by 2 fully connected layers for prediction. The first 20 convolutional layers are pretrained on ImageNet at half resolution (224 x 224), then the full network is fine-tuned on detection at 448 x 448.

The convolutional layers use alternating 1x1 reduction layers and 3x3 convolutional layers. The final output of the FC layers is reshaped into the 7 × 7 × 30 prediction tensor. Leaky ReLU activation (α = 0.1) is used throughout except for the final layer, which uses linear activation.

The paper also introduces Fast YOLO, a smaller variant with only 9 convolutional layers and fewer filters per layer. Fast YOLO achieves 155 FPS while still reaching 52.7 mAP on VOC 2007 — demonstrating that the single-regression framework scales down gracefully.

Grid-Based Prediction

The grid design encodes a strong spatial prior: each grid cell is responsible for detecting objects whose center falls within it. This means the cell at row i, column j predicts boxes only for objects centered in that cell. At test time, the class-specific confidence for each box is:

\Pr(\text{Class}_i \mid \text{Object}) × \Pr(\text{Object}) × \text{IoU}_\text{pred}^\text{truth} = \Pr(\text{Class}_i) × \text{IoU}_\text{pred}^\text{truth}

This yields per-box, per-class scores. After thresholding, non-maximum suppression (NMS) removes duplicate detections. With S = 7 and B = 2, YOLO produces 7 × 7 × 2 = 98 bounding box predictions per image — orders of magnitude fewer than the ~2000 proposals from Selective Search used by R-CNN.

The Multi-Part Loss Function

YOLO is trained end-to-end with a single sum-of-squared-errors loss that combines localization, confidence, and classification terms:

\begin{aligned} ℒ &= λ_\text{coord} Σ_i=0^S² Σ_j=0^B 1_ij^\text{obj} [ (x_i - x̂_i)² + (y_i - ŷ_i)² ] \ &+ λ_\text{coord} Σ_i=0^S² Σ_j=0^B 1_ij^\text{obj} [ (√(w_i) - √(w)_î)² + (√(h_i) - √(h)_î)² ] \ &+ Σ_i=0^S² Σ_j=0^B 1_ij^\text{obj} (C_i - Ĉ_i)² \ &+ λ_\text{noobj} Σ_i=0^S² Σ_j=0^B 1_ij^\text{noobj} (C_i - Ĉ_i)² \ &+ Σ_i=0^S² 1_i^\text{obj} Σ_{c ∈ \text{classes}} (p_i(c) - p̂_i(c))² \end{aligned}

Three design choices in this loss are worth noting:

Square-root width/height: The loss uses √(w) and √(h) rather than raw dimensions. This reflects the intuition that small deviations in large boxes matter less than in small boxes — a 10-pixel error on a 200-pixel box is less severe than on a 20-pixel box.
Weighted terms: λ_\text{coord} = 5 upweights localization loss, while λ_\text{noobj} = 0.5 downweights confidence loss for cells without objects. This is necessary because most grid cells contain no object, and without the weighting, the gradient signal from empty cells would overwhelm the useful signal from cells containing objects.
Indicator functions: 1_ij^\text{obj} is 1 when the j-th box predictor in cell i is "responsible" for an object (i.e., has the highest IoU with the ground truth among the B predictors in that cell).

Non-Maximum Suppression

After the network produces 98 scored bounding boxes, NMS removes duplicates. Boxes are sorted by confidence, and for each class, any box with IoU exceeding a threshold (typically 0.5) relative to a higher-confidence box is suppressed. This post-processing step adds 2–3 mAP on VOC.

Key Results

On PASCAL VOC 2007, YOLO achieves 63.4 mAP at 45 FPS. For context, Fast R-CNN reaches 70.0 mAP but at only 0.5 FPS, and Faster R-CNN achieves 73.2 mAP at 7 FPS. YOLO trades roughly 10 points of mAP for a 6–90x speedup depending on the baseline.

When combined with Fast R-CNN as an ensemble, accuracy jumps to 75.0 mAP — a 3.2 point boost over Fast R-CNN alone. This gain comes because YOLO makes far fewer background false positives (due to its global reasoning), complementing R-CNN's stronger localization.

On generalization, YOLO significantly outperforms DPM and R-CNN when transferring from natural images to artwork (Picasso and People-Art datasets). The gap is substantial: YOLO's mAP drops much less when the domain shifts, indicating that it learns more generalizable object representations rather than overfitting to photographic textures.

Comparison with the R-CNN Family

Model	mAP (VOC 2007)	FPS	Region Proposals	End-to-End
R-CNN	66.0	~0.05	Selective Search	No
Fast R-CNN	70.0	~0.5	Selective Search	Partial
Faster R-CNN	73.2	~7	RPN	Partial
YOLO	63.4	45	None	Yes
Fast YOLO	52.7	155	None	Yes

The fundamental trade-off is clear: YOLO sacrifices localization accuracy (especially on small objects) for throughput. Two-stage detectors use region proposals to focus computation on likely object locations, achieving better per-box accuracy. YOLO compensates with global reasoning — it sees the full image context when making predictions, which reduces background confusion errors by nearly half compared to Fast R-CNN.

Critical Analysis

Strengths:

Speed through architectural simplicity. One forward pass replaces multi-stage pipelines. The 45 FPS throughput enabled real-time detection applications that were previously impractical.
Global reasoning. Because the network sees the entire image during prediction, it encodes contextual information that region-based methods miss. This reduces background false positives and improves domain transfer.
Clean training. End-to-end optimization with a single loss function eliminates the need to separately tune proposal generation, feature extraction, and classification stages.

Limitations:

Struggles with small objects. The coarse 7x7 grid limits spatial resolution. Each cell predicts only 2 boxes, so densely packed small objects (e.g., a flock of birds) are systematically missed.
Grid constraints impose a hard ceiling. With S² × B = 98 total predictions, the network cannot detect more than 98 objects. More critically, each cell can only detect one object class, so nearby objects of different classes that share a cell will lose one detection.
Localization errors dominate. The paper's error analysis shows that YOLO's primary failure mode is imprecise bounding boxes, not misclassification. The sum-of-squared-errors loss treats localization errors equally across box sizes despite the square-root mitigation.
Aspect ratio sensitivity. The network is trained on a fixed 448x448 input, and the FC layers bake in spatial assumptions. Objects with unusual aspect ratios or at uncommon scales receive less training signal.

The YOLO Lineage

YOLO spawned one of the most prolific lineages in object detection:

YOLOv2 / YOLO9000 (Redmon & Farhadi, 2017) — introduced batch normalization, anchor boxes, multi-scale training, and WordTree for detecting 9000+ categories. Replaced FC layers with fully convolutional architecture.
YOLOv3 (Redmon & Farhadi, 2018) — adopted feature pyramid networks for multi-scale detection, switched to logistic classifiers for multi-label prediction, and used Darknet-53 as the backbone.
YOLOv4 (Bochkovskiy et al., 2020) — systematic integration of bag-of-freebies (data augmentation, label smoothing) and bag-of-specials (SPP, PAN) for optimized training.
YOLOv5–v11 (Ultralytics, various authors, 2020–2024) — continued iterations on architecture, training recipes, and deployment optimization. Later versions adopted anchor-free detection and transformer-based attention mechanisms.

Each version addressed specific YOLOv1 limitations — anchor boxes for aspect ratio diversity, feature pyramids for small objects, fully convolutional architectures for variable input sizes — while preserving the core single-pass philosophy.

Impact and Legacy

YOLOv1 did not achieve the highest accuracy on any benchmark. Its lasting contribution was demonstrating that object detection could be reframed as a direct regression problem without sacrificing practical utility. This insight — that a single-stage detector could achieve real-time performance with acceptable accuracy — shifted the field's trajectory.

The paper directly motivated SSD (Liu et al., 2016), which combined YOLO's single-pass approach with multi-scale feature maps. RetinaNet (Lin et al., 2017) later showed that the accuracy gap between one-stage and two-stage detectors was primarily due to class imbalance during training, not architectural limitations, and closed it with focal loss. Modern anchor-free detectors like FCOS and CenterNet further simplified the YOLO paradigm by eliminating predefined anchor boxes altogether.

Beyond research, YOLO popularized real-time object detection in industry. Autonomous driving perception stacks, video surveillance systems, and edge-device inference pipelines all trace design lineage back to the principle that detection speed and accuracy need not be fundamentally at odds.

Faster R-CNN — the two-stage detector that YOLO was benchmarked against, introducing Region Proposal Networks
DETR — end-to-end object detection with transformers, eliminating NMS and anchors entirely
Deep Residual Learning — ResNets that became standard backbones for later YOLO versions
Vision Transformer — the ViT architecture that influenced modern detection backbones