Plain ViT Backbones for Object Detection

Yanghao Li; Hanzi Mao; Ross Girshick; Kaiming He

TL;DR

Object detection has historically relied on hierarchical backbones (ResNet, Swin Transformer) that produce multi-scale feature pyramids. This paper asks a simple question: can a plain, non-hierarchical ViT — with single-scale features and no built-in multi-resolution structure — work as a competitive object detection backbone? The answer is yes, with minimal modifications. By using simple feature pyramid construction from intermediate ViT layers and window attention during fine-tuning, a plain ViT-Large backbone achieves 60.4 AP on COCO, matching or exceeding hierarchical alternatives like Swin-L and MViTv2-L while being architecturally simpler.

The Core Challenge: Single-Scale vs. Multi-Scale

The dominant paradigm in object detection before this paper was to use hierarchical backbones that naturally produce multi-scale feature pyramids. CNN-based detectors rely on feature pyramid networks (FPN) that extract features at multiple spatial resolutions. A ResNet, for instance, naturally produces feature maps at strides of 4, 8, 16, and 32 pixels through its successive pooling and strided convolution layers. Hierarchical vision transformers like Swin and MViTv2 were designed specifically to replicate this multi-scale structure, introducing progressively reduced spatial resolution at each stage.

These multi-scale features are critical for detecting objects across a wide range of sizes — small objects are best detected from high-resolution, low-stride feature maps, while large objects benefit from semantically richer, low-resolution maps.

A plain ViT has none of this structure. It processes an image as a flat sequence of non-overlapping patches (typically 16 × 16 pixels), applies L transformer blocks of identical dimension, and produces a single-scale feature map at stride 16. There is no downsampling, no resolution hierarchy, and no natural place to tap multi-scale features.

The computational structure of these two approaches differs fundamentally. In a hierarchical backbone, the feature map spatial dimensions shrink at each stage while channel dimensions grow, maintaining roughly constant FLOPs per stage. The total computation distributes across resolution levels:

\text{FLOPs}_{\text{hierarchical}} ≈ Σ_s=1⁴ C_s² · H · Wr_s²

where C_s is the channel dimension and r_s is the spatial reduction factor at stage s. In a plain ViT, all L layers operate at the same resolution with the same hidden dimension d, making the cost uniform across depth:

\text{FLOPs}_\text{plain} ≈ L · (12 · N · d² + 2 · N² · d)

where N = HW / p² is the number of patches. The question is whether this architectural simplicity is a fundamental limitation or merely a matter of adaptation.

Method: Minimal Adaptations to Plain ViT

The paper proposes a set of lightweight modifications that adapt ViT for detection without changing its core architecture. The backbone remains a plain, non-hierarchical transformer; only the interface between backbone and detector head is modified.

Simple Feature Pyramid from ViT Layers. Instead of building a feature pyramid from different spatial resolutions (as in FPN), the authors construct it from features at different depths of the ViT. They select feature maps from evenly spaced transformer blocks (e.g., blocks 6, 12, 18, 24 from a ViT with L = 24 layers) and apply lightweight upsampling and downsampling operations to create feature maps at strides 4, 8, 16, and 32:

Stride 4: two successive 2 × deconvolutions applied to features from the shallowest selected layer
Stride 8: one 2 × deconvolution
Stride 16: identity (the native ViT resolution)
Stride 32: one 2 × strided convolution applied to features from the deepest layer

The key insight: features from early layers tend to capture lower-level information (edges, textures) while deeper layers capture higher-level semantics, providing a form of multi-scale representation even without multi-resolution spatial structure.

Window Attention for High-Resolution Inputs. Standard global self-attention has quadratic cost in the number of tokens. For an image of resolution 1024 × 1024 with patch size 16, there are N = 4096 tokens, making the attention matrix N × N = 16M elements — a substantial memory and compute burden. The paper uses windowed self-attention during fine-tuning, where attention is computed within local 14 × 14 windows. This reduces cost from O(N²) to O(N · w²) where w = 196 is the window size.

The critical design choice is how to maintain cross-window information flow. Without it, each window operates in isolation and the model cannot reason about objects that span multiple windows. The authors address this by designating a small number of global attention blocks (typically 4 evenly spaced blocks out of 24 total) that compute full N × N attention. These global blocks act as information bridges — features from different spatial regions can interact at these layers, while the remaining windowed layers handle local processing efficiently. This is a simpler approach than Swin Transformer’s shifted window mechanism, which achieves cross-window communication through alternating window configurations at every layer.

Pre-training Compatibility. A practical advantage of using a plain ViT is that it can leverage existing pre-trained checkpoints from MAE (Masked Autoencoder) self-supervised learning. The backbone weights transfer directly because the architecture is unchanged — only the detection head and feature pyramid are added during fine-tuning. This is a meaningful advantage over hierarchical vision transformers that require pre-training from scratch or architecture-specific self-supervised methods.

Why Multi-Scale From Depth Works

The effectiveness of extracting multi-scale features from different ViT depths (rather than different spatial resolutions) deserves closer examination. In a plain ViT, the self-attention operation at layer l computes:

\text{Attention}(Q, K, V) = \text{softmax}\!(QK^T√(d_k))V

At every layer, each token attends to all other tokens within its window (or globally). But the nature of the learned representations changes with depth. Probing studies (Raghu et al. 2021) have shown that early ViT layers learn representations similar to convolutional features — local textures, edges, and color gradients — while deeper layers learn increasingly abstract, semantically rich representations. This depth-dependent feature hierarchy is analogous to the spatial-resolution hierarchy in CNNs, but arises from the progressive abstraction of self-attention rather than from spatial downsampling.

The feature pyramid construction exploits this by tapping layers at evenly spaced intervals. For a ViT-L with 24 layers, the authors extract features from layers 24. The shallow features (layer 6) retain local spatial information useful for small-object detection, while deep features (layer 24) provide the semantic context needed for large-object recognition and classification. Deconvolution layers upsample shallow features to stride 4, while strided convolutions downsample deep features to stride 32, creating a standard 4-level pyramid compatible with FPN-based detector heads.

This design has an elegant property: the backbone itself is unmodified. All multi-scale processing happens in the lightweight pyramid neck, meaning the same backbone checkpoint can be reused across detection, segmentation, and classification with only the neck and head changing. The pyramid construction adds minimal overhead — the deconvolution and strided convolution layers are small relative to the ViT backbone — and does not require any changes to the pre-training procedure.

Key Results

The experiments use Mask R-CNN and Cascade Mask R-CNN as detection frameworks on COCO.

Plain ViT matches hierarchical backbones. ViT-B with the proposed adaptations achieves 51.6 box AP on COCO using Cascade Mask R-CNN, comparable to Swin-B (51.9 AP) and MViTv2-B (51.7 AP). Scaling to ViT-L reaches 54.0 AP, and with ViT-H the system achieves 60.4 AP — the highest single-model result on COCO at the time of publication.

MAE pre-training is critical. When initialized with MAE pre-training, ViT-L outperforms its supervised ImageNet-pretrained counterpart by approximately 4 AP. This gap is substantially larger than the pre-training improvement seen with hierarchical backbones (where MAE yields roughly 1–2 AP improvement), suggesting that plain ViTs benefit disproportionately from self-supervised pre-training for dense prediction tasks. The likely explanation is that MAE forces the model to learn strong local and spatial representations through the masking-and-reconstruction objective — exactly the kind of spatial reasoning that hierarchical architectures build in through their multi-scale structure but that a plain ViT must learn from data.

Simplicity has computational benefits. The plain ViT backbone, despite operating at a single scale internally, achieves competitive FLOPs and throughput compared to hierarchical alternatives. The regular, non-branching structure of plain ViT is also more hardware-friendly — it maps cleanly to modern accelerators without the irregular memory access patterns introduced by multi-scale processing.

Scaling behavior is strong. The paper demonstrates consistent improvements when scaling from ViT-B (86M parameters) to ViT-L (304M) to ViT-H (632M). The AP improvements are roughly log-linear with model size, and importantly, the gap between plain ViT and hierarchical backbones narrows with scale. At the ViT-H level, the plain backbone is unambiguously better than any hierarchical alternative tested, suggesting that the multi-scale inductive bias becomes less important as model capacity increases.

Window attention configuration matters. The ablation studies show that using 4 global attention blocks (out of 24 total) interspersed with windowed attention blocks provides a good balance. Using fewer global blocks degrades cross-region reasoning, while using more global blocks increases cost without proportionate accuracy gains. The window size of 14 × 14 tokens was chosen to match the pre-training resolution of 224 / 16 = 14 patches per side, allowing positional embedding reuse.

Critical Analysis

Strengths:

The paper makes a strong case for architectural simplicity. Rather than designing increasingly complex hierarchical vision transformers (Swin, PVT, MViTv2, each adding architectural innovations), it shows that a plain ViT with trivial adaptations can match these systems. This is a useful counterpoint to the trend of ever-more-complex architectures.
The experimental methodology is thorough, with controlled comparisons across backbone sizes (ViT-B, L, H), pre-training strategies (supervised, MAE), and detection frameworks. The ablation studies on window attention configurations and feature pyramid design are well-structured.
The practical implication is significant: teams can use a single plain ViT checkpoint for both classification and detection, simplifying the model development pipeline. This “one backbone, many heads” approach reduces engineering complexity and the computational cost of maintaining separate pre-trained models for different tasks.

Limitations:

The windowed attention mechanism and feature pyramid construction, while simple, are not entirely “plain.” The claim of using a plain ViT backbone is slightly overstated — the detection-specific adaptations (deconvolutions for upsampling, windowed attention) add architectural complexity, even if it is lightweight.
The strong results depend heavily on MAE pre-training. Without it, the plain ViT backbone underperforms hierarchical alternatives by a meaningful margin. This couples the detector’s effectiveness to the quality of self-supervised pre-training.
The paper does not explore single-stage detectors (FCOS, YOLO-style) or transformer-based detectors (DETR) in depth. It remains an open question whether the conclusions hold beyond the Mask R-CNN family.
Inference latency analysis is limited. FLOPs and throughput are reported, but actual latency measurements across batch sizes and hardware configurations would strengthen the efficiency claims.
The comparison with DETR-family detectors is notably absent. DETR and Deformable DETR take a fundamentally different approach to detection (set prediction with bipartite matching rather than region proposals), and understanding how plain ViT backbones interact with these end-to-end detectors would broaden the paper’s conclusions.
Small-object detection, where multi-scale features are most critical, receives limited dedicated analysis. While the overall AP numbers are strong, a breakdown by object size (AP-small, AP-medium, AP-large) would clarify whether the depth-based feature pyramid is truly equivalent to a spatial-resolution-based one for small objects.

Impact and Legacy

This paper, known as ViTDet, became the standard approach for using plain ViT backbones in detection and influenced the broader trend toward simpler, non-hierarchical architectures for dense prediction tasks. It demonstrated that the complexity of hierarchical vision transformers (Swin’s shifted windows, MViTv2’s pooled attention) may be unnecessary when strong pre-training is available.

The work directly informed Meta’s Segment Anything Model (SAM), which uses a ViTDet-style backbone for its image encoder. SAM’s success in promptable segmentation provided further evidence that plain ViT backbones are sufficient for dense visual tasks when properly pre-trained. The ViTDet approach also appears in DINOv2 and other foundation model pipelines where a single pre-trained backbone is adapted to multiple downstream tasks.

The paper also influenced how the community thinks about the relationship between pre-training and architecture. The finding that MAE pre-training narrows the gap between plain and hierarchical ViTs suggests a general principle: strong self-supervised pre-training can compensate for missing architectural inductive biases. This has practical implications for the model development lifecycle — rather than engineering task-specific architectures, teams can invest in pre-training a single plain ViT and adapt it to diverse downstream tasks (classification, detection, segmentation, depth estimation) through lightweight task-specific heads.

The ViTDet recipe (plain ViT backbone + windowed attention + simple feature pyramid) has been adopted beyond COCO object detection. It has been applied to video understanding, medical image analysis (where hierarchical features were previously considered essential for multi-scale pathology detection), and 3D point cloud processing. In each case, the core lesson holds: a plain transformer with minimal task-specific adaptation can match or exceed purpose-built architectures when pre-training is sufficiently strong.

The paper also contributed to the emerging consensus that foundation models benefit from architectural simplicity. A plain ViT backbone can be pre-trained once (with MAE, DINO, or similar self-supervised methods) and then deployed across classification, detection, segmentation, and generation tasks with only lightweight adapters. This amortizes the substantial pre-training cost across many downstream applications, a model that becomes increasingly attractive as pre-training datasets and compute budgets grow.

More broadly, the paper contributed to a reassessment of architectural inductive biases in computer vision. The results suggest that with sufficient pre-training data and appropriate self-supervised objectives, learned representations can substitute for architectural priors like multi-scale feature hierarchies — echoing a recurring theme in deep learning where scale and data diminish the value of hand-designed structure.

Vision Transformer (ViT) — the plain ViT architecture that serves as the backbone in this work
Attention Is All You Need — the original transformer, whose self-attention mechanism underpins ViT
DINO — self-supervised ViT pre-training that provides complementary pre-training strategies for dense prediction
Data Movement Is All You Need — analysis of the memory-bound operations that dominate transformer inference, relevant to ViTDet’s efficiency considerations