Modern Object Detection: DETR and Transformer-Based Approaches

Understanding end-to-end object detection with transformers, from DETR's object queries to bipartite matching and attention-based localization

Best viewed on desktop for optimal interactive experience

Overview

Modern object detection has evolved from complex multi-stage pipelines (R-CNN family) and anchor-based single-shot detectors (YOLO, SSD) to elegant transformer-based architectures. DETR (DEtection TRansformer) pioneered this shift by treating object detection as a direct set prediction problem, eliminating hand-designed components like anchor boxes, non-maximum suppression (NMS), and region proposal networks.

The key innovation is using learned object queries that attend to image features via cross-attention, enabling the model to reason globally about all objects simultaneously. Combined with bipartite matching during training, DETR achieves end-to-end detection in a single forward pass with no post-processing.

Key Concepts

Object Queries

100 learned embeddings that specialize to detect different objects, positions, and scales during training

Set-Based Prediction

Predict all objects simultaneously without sequential decoding or duplicate suppression

Bipartite Matching

Hungarian algorithm optimally assigns predictions to ground truth during training

Cross-Attention

Object queries attend to encoder features to gather visual information for detection

End-to-End Training

No anchor boxes, NMS, or region proposals - just image in, detections out

Global Reasoning

Transformer self-attention enables reasoning about all objects and their relationships

DETR Architecture Walkthrough

DETR Architecture

End-to-End Object Detection with Transformers

Input Image

H × W × 3 (e.g., 800 × 1066 × 3)

Input Processing

  • • RGB image of arbitrary size
  • • Resized with max dimension 1333px
  • • Normalized with ImageNet mean/std
  • • Batched with padding if needed

# No region proposals needed!
# No anchor boxes!
# Just raw image pixels

DETR's Key Innovation

By treating object detection as a set prediction problem with bipartite matching, DETR eliminates hand-designed components like anchor boxes, NMS, and region proposals — achieving a truly end-to-end detection pipeline.

How It Works

1

CNN Feature Extraction

ResNet-50 backbone extracts visual features, reducing spatial dimensions by 32x

features = backbone(image)  # (B, 2048, H/32, W/32)
2

Positional Encoding

Add 2D sine/cosine positional encoding to preserve spatial information

features = features + positional_encoding(H, W)
3

Transformer Encoder

6 layers of self-attention enable global context aggregation across all positions

memory = encoder(features.flatten())  # Global context
4

Object Query Decoding

100 learned queries attend to encoder output via cross-attention

output = decoder(object_queries, memory)  # (100, 256)
5

Prediction Heads

FFN heads predict class labels and normalized bounding boxes

classes = class_head(output)  # (100, num_classes+1)
bboxes = box_head(output)  # (100, 4)
6

Set Prediction Loss

Hungarian matching assigns predictions to GT, then compute classification + box loss

matched_indices = hungarian_match(preds, targets)
loss = class_loss + bbox_loss

Object Query Attention Visualization

Object Query Attention

How learned object queries attend to image features via cross-attention

Object Queries

Cross-Attention to Image Features

Decoder Layer:
Low attention
High attention

Attention Evolution Through Layers

Early layers (1-2): Broad, diffuse attention - queries explore the entire image.
Middle layers (3-4): Attention starts focusing on relevant regions.
Late layers (5-6): Sharp, localized attention on specific objects.

Object Queries

100 learned embeddings that specialize to detect different objects

Cross-Attention

Queries attend to encoder features to gather visual information

Parallel Decoding

All queries process simultaneously - no sequential decoding needed

Understanding Object Queries

Object queries are the heart of DETR's innovation. Unlike anchor boxes that are fixed spatial priors, object queries are learned embeddings that develop specializations during training:

  • Some queries learn to detect objects in specific image regions
  • Others specialize for particular object scales or aspect ratios
  • The queries communicate via self-attention to avoid duplicate detections

Each query independently attends to the encoder features and produces one prediction. Most queries predict "no object" for images with few objects.

Bipartite Matching for Training

Bipartite Matching in DETR

How Hungarian algorithm assigns predictions to ground truth during training

Initial State

DETR outputs N=100 predictions (showing 8 for clarity)

Predictions (N=100, showing 8)

Q0
person
conf: 95%
Q1
dog
conf: 87%
Q2
car
conf: 91%
Q3
person
conf: 23%
Q4
cat
conf: 15%
Q5
bird
conf: 8%
Q6
no object
conf: 92%
Q7
no object
conf: 88%

Ground Truth (M=3)

G0
person
bbox: [0.14, 0.18, 0.36, 0.85]
G1
dog
bbox: [0.54, 0.50, 0.76, 0.90]
G2
car
bbox: [0.60, 0.10, 0.93, 0.44]

Key Insight: Hungarian matching ensures each ground truth is assigned to exactly one prediction, enabling set-based training without NMS or anchor boxes.

Bipartite Matching Loss

Traditional detectors assign multiple predictions to each ground truth (via IoU thresholds), then suppress duplicates with NMS. DETR takes a fundamentally different approach:

  1. Cost Matrix: Compute pairwise costs between all predictions and ground truth objects
  2. Hungarian Algorithm: Find optimal 1-to-1 assignment minimizing total cost
  3. Loss Computation: Only matched pairs contribute to the detection loss

The cost combines classification probability and bounding box distance (GIoU + L1):

L_match(y_i, ŷ_σ(i)) = -1_{c_i≠∅} p̂_σ(i)(c_i) + 1_{c_i≠∅} L_box(b_i, b̂_σ(i))

Real-World Applications

Autonomous Driving

End-to-end detection of vehicles, pedestrians, and obstacles without complex post-processing

Tesla's vision-only approach uses transformer-based detection

Medical Imaging

Detect tumors, lesions, and anatomical structures in X-rays, CT, and MRI scans

Global context helps reason about rare pathologies

Satellite & Aerial Imagery

Detect buildings, vehicles, and infrastructure across large-scale images

Handles varying object densities without anchor tuning

Robotics & Manipulation

Real-time object detection for grasping and navigation

Deformable DETR achieves real-time performance

Video Understanding

Track objects across frames using query propagation

TrackFormer extends DETR for multi-object tracking

Document Analysis

Detect text regions, tables, and figures in documents

Layout analysis benefits from global context

Simplified DETR Implementation

import torch import torch.nn as nn from torchvision.models import resnet50 class SimpleDETR(nn.Module): def __init__(self, num_classes=91, num_queries=100, hidden_dim=256): super().__init__() # CNN Backbone (ResNet-50, remove final layers) backbone = resnet50(pretrained=True) self.backbone = nn.Sequential(*list(backbone.children())[:-2]) self.conv = nn.Conv2d(2048, hidden_dim, 1) # Reduce channels # Transformer self.transformer = nn.Transformer( d_model=hidden_dim, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, batch_first=True ) # Object queries (learned embeddings) self.query_embed = nn.Embedding(num_queries, hidden_dim) # Prediction heads self.class_head = nn.Linear(hidden_dim, num_classes + 1) # +1 for "no object" self.bbox_head = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 4), nn.Sigmoid() # Normalized coordinates ) # Positional encoding self.pos_encoder = PositionalEncoding2D(hidden_dim) def forward(self, x): # Extract features: (B, 3, H, W) -> (B, 2048, H/32, W/32) features = self.backbone(x) features = self.conv(features) # (B, 256, H/32, W/32) B, C, H, W = features.shape # Add positional encoding pos = self.pos_encoder(H, W).to(x.device) features = features + pos # Flatten spatial dimensions: (B, 256, H*W) -> (B, H*W, 256) src = features.flatten(2).permute(0, 2, 1) # Object queries: (num_queries, 256) -> (B, num_queries, 256) queries = self.query_embed.weight.unsqueeze(0).repeat(B, 1, 1) # Transformer: encode image, decode with queries output = self.transformer(src, queries) # (B, num_queries, 256) # Predictions class_logits = self.class_head(output) # (B, 100, num_classes+1) bbox_pred = self.bbox_head(output) # (B, 100, 4) return {'pred_logits': class_logits, 'pred_boxes': bbox_pred} class PositionalEncoding2D(nn.Module): def __init__(self, hidden_dim, temperature=10000): super().__init__() self.hidden_dim = hidden_dim self.temperature = temperature def forward(self, h, w): y_embed = torch.arange(h).unsqueeze(1).repeat(1, w) x_embed = torch.arange(w).unsqueeze(0).repeat(h, 1) dim_t = torch.arange(self.hidden_dim // 2) dim_t = self.temperature ** (2 * dim_t / self.hidden_dim) pos_x = x_embed.unsqueeze(-1) / dim_t pos_y = y_embed.unsqueeze(-1) / dim_t pos_x = torch.stack([pos_x.sin(), pos_x.cos()], dim=-1).flatten(-2) pos_y = torch.stack([pos_y.sin(), pos_y.cos()], dim=-1).flatten(-2) pos = torch.cat([pos_y, pos_x], dim=-1) return pos.permute(2, 0, 1).unsqueeze(0) # (1, hidden_dim, H, W) # Usage model = SimpleDETR(num_classes=91) image = torch.randn(2, 3, 800, 1066) output = model(image) print(f"Class predictions: {output['pred_logits'].shape}") # (2, 100, 92) print(f"Box predictions: {output['pred_boxes'].shape}") # (2, 100, 4)

Advantages & Limitations

Advantages

  • End-to-end training without hand-crafted components (anchors, NMS)
  • Global reasoning via self-attention handles object relationships
  • Excellent performance on large objects due to global context
  • Simpler pipeline with fewer hyperparameters to tune
  • Easily extended to panoptic segmentation and tracking
  • Set-based prediction naturally handles varying object counts

Limitations

  • ×Slow training convergence (500 epochs vs 36 for Faster R-CNN)
  • ×Struggles with small objects due to coarse feature resolution
  • ×High memory usage from full attention over image features
  • ×Fixed number of queries limits maximum detections per image
  • ×Requires careful learning rate scheduling and data augmentation
  • ×Inference slower than optimized CNN-based detectors

Evolution: Deformable DETR and Beyond

The original DETR's limitations sparked rapid innovation:

ModelKey ImprovementTraining EpochsSmall Object AP
DETROriginal50022.5
Deformable DETRSparse attention, multi-scale5028.8
Conditional DETRConditional cross-attention50-
DAB-DETRAnchor-based queries50-
DINODenoising training1232.1
RT-DETRReal-time focus7235.1

Deformable DETR is particularly impactful: it replaces full attention with deformable attention that only attends to a small set of key sampling points, reducing complexity from O(N²) to O(N).

Best Practices

  • Use Deformable Attention: For practical applications, use Deformable DETR or its variants for faster training and better small object detection
  • Multi-Scale Features: Use FPN-style multi-scale features to capture objects at different sizes
  • Auxiliary Losses: Add prediction heads after each decoder layer for faster convergence
  • Strong Augmentation: Use aggressive data augmentation (random crop, scale jitter, color jitter) to compensate for slow convergence
  • Careful Learning Rate: Use lower learning rate for backbone (1e-5) and higher for transformer (1e-4)
  • Query Design: Consider anchor-based queries (DAB-DETR) or denoising training (DINO) for faster convergence

Mathematical Foundation

The detection loss combines classification and localization:

L(y, ŷ) = Σ_i [ λ_cls · L_cls(c_i, ĉ_σ(i)) + 1_{c_i≠∅} · (λ_L1 · ||b_i - b̂_σ(i)||_1 + λ_giou · L_giou(b_i, b̂_σ(i))) ]

Where:

  • σ is the optimal assignment from Hungarian matching
  • L_cls is focal loss for classification (handles class imbalance)
  • L_giou is generalized IoU loss for better box regression
  • λ_cls=2, λ_L1=5, λ_giou=2 are loss weights

The Hungarian matching cost:

C_match(y_i, ŷ_j) = -p̂_j(c_i) + λ_L1 · ||b_i - b̂_j||_1 + λ_giou · L_giou(b_i, b̂_j)

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon