End-to-End Object Detection with Transformers

Nicolas Carion; Francisco Massa; Gabriel Synnaeve; Nicolas Usunier; Alexander Kirillov; Sergey Zagoruyko

TL;DR

DETR reframes object detection as a direct set prediction problem. Instead of the proposal-then-classify pipeline used by Faster R-CNN and its descendants — with anchor boxes, non-maximum suppression, and hand-tuned post-processing — DETR feeds image features through a transformer encoder-decoder and produces a fixed-size set of predictions in a single pass. A bipartite matching loss (the Hungarian algorithm) assigns predictions to ground truth during training, enforcing one-to-one correspondence without any duplicate suppression heuristics. The result matches Faster R-CNN on COCO with a dramatically simpler pipeline, though at the cost of slow training convergence and weaker performance on small objects.

The Core Idea: Detection as Set Prediction

Traditional object detectors generate thousands of overlapping candidate boxes, score each one, then apply non-maximum suppression (NMS) to remove duplicates. Every stage involves hand-designed rules: anchor aspect ratios, IoU thresholds for positive/negative assignment, NMS overlap thresholds. These components are effective but brittle — performance is sensitive to their tuning, and they introduce non-differentiable steps into an otherwise learnable pipeline.

DETR sidesteps all of this by treating detection as a set prediction problem. The model outputs a fixed set of N predictions (where N is chosen to be larger than any expected number of objects in an image, typically 100). Each prediction is either a bounding box with a class label or a special "no object" (\varnothing) token. During training, the Hungarian algorithm finds the optimal one-to-one assignment between predictions and ground truth, and the loss is computed only on matched pairs.

This formulation has two key properties: (1) each ground truth object is matched to exactly one prediction, so duplicates cannot occur by construction, and (2) the entire pipeline is differentiable end-to-end (the Hungarian algorithm runs only for loss computation, not in the forward pass).

Architecture: Backbone + Transformer + FFN

DETR's architecture has three components, each with a clear role.

CNN Backbone. A ResNet (typically ResNet-50) extracts a feature map from the input image. For a 3 × H × W input, the backbone produces a feature map f ∈ ℝ^{C × H' × W'} where C = 2048 and the spatial resolution is reduced by a factor of 32. A 1 × 1 convolution projects this to a lower dimension d (256 in the paper). The resulting d × H' × W' tensor is flattened into a sequence of H'W' tokens, each of dimension d, and augmented with fixed sinusoidal positional encodings that encode spatial location.

Transformer Encoder. The flattened feature sequence passes through a standard transformer encoder (6 layers). Self-attention across all spatial positions allows the encoder to reason about global context — for instance, disambiguating overlapping objects or reasoning about relative scale. The output is a globally-refined feature sequence of the same shape.

Transformer Decoder. The decoder takes N learned object queries as input and attends to the encoded image features via cross-attention. Each object query is a learned d-dimensional embedding that the model trains to specialize for detecting objects in particular spatial regions, scales, or categories. The decoder applies self-attention across queries (so they can coordinate to avoid duplicates) followed by cross-attention to the encoder output. After 6 decoder layers, each query has aggregated the image information it needs.

Prediction Heads (FFN). Two small feed-forward networks operate independently on each decoder output. One predicts the class label (including \varnothing for "no object"), and the other predicts a normalized bounding box as a 4-tuple (c_x, c_y, w, h) representing center coordinates, width, and height relative to the image size.

Hungarian Matching: The Training Signal

The central technical challenge is defining a loss for unordered set prediction. With N predictions and M ground truth objects (where M ≪ N), we need to decide which prediction is responsible for which ground truth before we can compute any loss.

DETR solves this with bipartite matching. Let σ̂ denote a permutation of N elements. The optimal assignment minimizes the total matching cost:

σ̂ = \argmin_{σ ∈ \mathfrak{S}_N} Σ_i=1^N ℒ_match(y_i, ŷ_σ(i))

where y_i is the i-th ground truth (padded with \varnothing to length N) and ŷ_σ(i) is the prediction assigned to it. The pairwise matching cost for a ground truth object with class c_i and box b_i is:

ℒ_match(y_i, ŷ_σ(i)) = -\mathbf{1}_{\{c_i ≠ \varnothing\}} p̂_σ(i)(c_i) + \mathbf{1}_{\{c_i ≠ \varnothing\}} ℒ_box(b_i, b̂_σ(i))

The first term rewards high predicted probability for the correct class. The second term penalizes box misalignment. This is solved exactly in O(N³) time by the Hungarian algorithm — fast enough since N = 100.

Once the optimal assignment σ̂ is found, the actual training loss is computed over all matched pairs:

ℒ_Hungarian = Σ_i=1^N [ -log p̂_σ(i)̂(c_i) + \mathbf{1}_{\{c_i ≠ \varnothing\}} ℒ_box(b_i, b̂_σ(i)̂) ]

The box loss combines an L1 term with a generalized IoU (GIoU) loss for scale-invariant regression:

ℒ_box = λ_iou ℒ_GIoU(b_i, b̂_σ(i)̂) + λ_L1 \| b_i - b̂_σ(i)̂ \|₁

GIoU is important because L1 loss alone penalizes small and large box errors equally in absolute terms, while IoU-based losses are scale-invariant.

Object Queries and Parallel Decoding

A distinctive feature of DETR is its parallel decoding via object queries. Unlike autoregressive detectors that predict objects sequentially, DETR predicts all N objects simultaneously in a single decoder forward pass. Each object query attends independently to the full image, and queries coordinate through self-attention.

In practice, the learned object queries develop spatial specialization. Visualization of trained queries shows that individual queries tend to focus on specific image regions and object sizes — one query might specialize in large objects near the image center, while another handles small objects in the bottom-left corner. This is an emergent behavior; the queries are randomly initialized and learn their specializations entirely from data.

The parallel decoding is a significant architectural advantage. Autoregressive approaches (like early versions of Pix2Seq) must predict objects in some arbitrary order, which introduces sequential dependencies and slows inference. DETR's parallel formulation makes inference time independent of the number of objects.

Key Results

On COCO 2017, DETR with a ResNet-50 backbone achieves 42.0 AP, matching Faster R-CNN with the same backbone (42.0 AP) after extensive hyperparameter tuning of the latter. With a ResNet-101 backbone and a larger feature resolution using dilated convolutions (DC5), DETR reaches 43.3 AP.

However, the AP breakdown by object size reveals a clear pattern:

Large objects (AP_L): DETR significantly outperforms Faster R-CNN (61.1 vs 57.0), likely because the transformer's global self-attention captures long-range context that region-based methods miss.
Small objects (AP_S): DETR underperforms (20.5 vs 24.4). The 32x spatial downsampling in the backbone discards fine-grained information that small object detection requires. Feature pyramid networks (FPN), standard in Faster R-CNN variants, are absent from the original DETR.

DETR also demonstrates strong panoptic segmentation results, achieving 46.0 PQ on COCO panoptic by simply adding a segmentation head on top of the decoder outputs — evidence that the architecture generalizes beyond bounding box detection.

Critical Analysis

Strengths.

Eliminates NMS, anchor boxes, and all hand-designed post-processing. The detection pipeline reduces to backbone + transformer + FFN + Hungarian matching.
End-to-end differentiable training with a clean set prediction loss. No surrogate objectives, no complex label assignment rules.
Global reasoning via self-attention handles occlusion and context naturally. This directly explains the strong AP_L performance.
The architecture generalizes to panoptic segmentation with minimal modification.
Object queries provide a natural mechanism for parallel decoding with no ordering assumptions.

Limitations.

Slow convergence. DETR requires 300 epochs on COCO (roughly 10-12 days on 16 V100 GPUs), compared to the standard 36-epoch (3x) schedule for Faster R-CNN. The cross-attention mechanism must learn to attend to the right image regions from scratch, without the spatial priors that anchors provide.
Weak on small objects. The single-scale feature map at 1/32 resolution cannot represent objects smaller than a few pixels in feature space. Multi-scale feature pyramids, standard in modern detectors, are not used.
Fixed query count. The number of object queries N must be set as a hyperparameter and must exceed the maximum number of objects in any training image. Setting it too high wastes computation; too low causes missed detections.
High memory cost. Self-attention over all spatial tokens is O((H'W')²) in memory, which limits the input resolution and makes high-resolution feature maps impractical.

Follow-Up Work

Deformable DETR (Zhu et al., ICLR 2021) directly addresses the convergence and small-object problems. It replaces standard attention with deformable attention, where each query attends to only a small set of learned sampling points rather than all spatial locations. This reduces complexity from O((H'W')²) to O(H'W' · K) where K is the number of sampling points (typically 4). Combined with multi-scale features, Deformable DETR converges in 50 epochs (6x faster) and improves small object AP substantially.

DAB-DETR (Liu et al., ICLR 2022) reinterprets object queries as dynamic anchor boxes — 4D coordinates (c_x, c_y, w, h) that are iteratively refined across decoder layers. This provides an explicit spatial prior, accelerating convergence and improving interpretability.

DINO (Zhang et al., ICLR 2023) combines deformable attention, contrastive denoising training, and a mixed query selection strategy to push the DETR paradigm to 63.2 AP on COCO with a Swin-L backbone — surpassing all previous detectors. DINO demonstrated that DETR-style architectures, once their convergence issues were resolved, could definitively outperform classical region-based methods.

RT-DETR (Lv et al., 2023) focuses on real-time deployment, achieving strong accuracy-speed trade-offs without NMS by building an efficient hybrid encoder and uncertainty-aware query selection.

Impact and Legacy

DETR's contribution is primarily conceptual rather than metric-driven. Its COCO numbers merely matched Faster R-CNN, but it demonstrated that object detection could be formulated as a clean set prediction problem without any detection-specific inductive biases (anchors, NMS, proposal generation). This was a paradigm shift.

The practical impact materialized through follow-up work. By 2023, DINO-based detectors held state-of-the-art on COCO, and the DETR formulation had been extended to instance segmentation (Mask DINO), 3D detection (DETR3D), multi-object tracking (TrackFormer, MOTR), and visual grounding (MDETR). The "set prediction with Hungarian matching" paradigm now underlies a significant fraction of modern detection and segmentation systems.

DETR also influenced the broader trend of replacing task-specific architectures with general-purpose transformer-based designs. SAM's prompt-based segmentation, for instance, uses a decoder structure clearly descended from DETR's object query mechanism.

Faster R-CNN — the dominant two-stage detector that DETR was benchmarked against, with region proposals, anchor boxes, and NMS
YOLO — single-stage detection with grid-based prediction, a different approach to eliminating the proposal stage
Attention Is All You Need — the transformer architecture that DETR adapts from NLP to vision
Vision Transformer — ViT's approach to applying transformers to images, which shares DETR's insight that self-attention can replace spatial inductive biases
Segment Anything (SAM) — prompt-based segmentation using a decoder architecture descended from DETR's object query mechanism