DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab; Timothée Darcet; Théo Moutakanni; Huy Vo; Marc Szafraniec; Vasil Khalidov; Pierre Fernandez; Daniel Haziza; Francisco Massa; Alaaeldin El-Nouby; Mahmoud Assran; Nicolas Ballas; Wojciech Galuba; Russell Howes; Po-Yao Huang; Shang-Wen Li; Ishan Misra; Michael Rabbat; Vasu Sharma; Gabriel Synnaeve; Hu Xu; Hervé Jégou; Julien Mairal; Patrick Labatut; Armand Joulin; Piotr Bojanowski

Paper Overview

Self-supervised learning has produced increasingly powerful visual features, but until DINOv2, no single method could match task-specific supervised models across the full range of vision tasks — classification, segmentation, depth estimation, retrieval, and video understanding — using frozen features alone. Contrastive methods like DINO produce excellent classification features but weaker dense prediction capabilities. Masked image modeling methods like MAE learn rich spatial representations but require fine-tuning to become competitive on classification. DINOv2 unifies these complementary strengths by combining DINO’s self-distillation objective with iBOT’s masked patch prediction, training at scale on carefully curated data, and distilling the resulting knowledge into efficient models.

Published in TMLR 2024 by a large team at Meta AI led by Maxime Oquab, DINOv2 makes three key engineering contributions beyond the algorithmic combination. First, a data curation pipeline that builds LVD-142M — a 142-million image dataset retrieved from web-crawled sources using curated seed images, then deduplicated to ensure diversity. Second, training a massive ViT-g/14 model (1.1 billion parameters) with stabilization techniques including Sinkhorn-Knopp centering and KoLeo regularization. Third, distilling the ViT-g teacher into efficient ViT-S, ViT-B, and ViT-L students that retain most of the teacher’s performance at a fraction of the compute cost. The result is a family of visual backbones whose frozen features achieve 86.5% linear probe accuracy on ImageNet-1K (ViT-g) and set new state-of-the-art results across 12+ benchmarks without any task-specific fine-tuning.

DINOv2’s core claim is that self-supervised learning can produce visual features that are truly general-purpose — features that work as well for pixel-level segmentation as for image-level classification, without any adaptation. This is a qualitative shift from prior methods that excelled at one task family but required fine-tuning or architectural modification for others. The combination of strong algorithmic design, large-scale curated data, and careful engineering establishes DINOv2 as the de facto standard for frozen visual features in the research community.

Combined Training Objective

DINOv2’s training objective combines two complementary self-supervised signals. The first is the DINO loss, which operates at the image level: a student network processes multiple crops (2 global + 8 local) of an image, while a momentum-updated teacher network processes only the global crops. The student’s [CLS] token representations are trained to match the teacher’s [CLS] representations through a cross-entropy loss over Sinkhorn-normalized soft targets. This image-level objective encourages the model to learn holistic semantic representations that are invariant to crop position and scale.

The second component is the iBOT loss, which operates at the patch level. Within each global crop processed by the student, a random subset of patches is masked. The student produces representations for these masked positions using [MASK] tokens, and these representations are trained to match the teacher’s representations for the same patch positions (computed from the unmasked image). The combined loss is:

ℒ_\text{DINOv2} = ℒ_\text{DINO}^\text{cls} + λ · ℒ_\text{iBOT}^\text{patch}

where λ balances the two objectives. The DINO component drives global semantic understanding — recognizing that a crop of a dog’s face and a crop of its body belong to the same image. The iBOT component drives local spatial understanding — learning what visual content occupies specific spatial positions. Together, they produce features that are simultaneously strong for image-level tasks (classification, retrieval) and dense prediction tasks (segmentation, depth estimation) without fine-tuning.

Two stabilization techniques prevent the teacher’s output from collapsing to a trivial uniform or peaked distribution. Sinkhorn-Knopp centering replaces DINO’s simple mean centering with an iterative algorithm that enforces a uniform marginal distribution over the output dimensions, providing stronger anti-collapse guarantees. The KoLeo regularizer encourages uniform coverage of the feature space by penalizing close pairs in the embedding space, preventing the model from clustering all representations into a few modes.

Data Curation at Scale

A central thesis of DINOv2 is that curated data matters more than algorithmic novelty. Most self-supervised methods train on ImageNet-1K (1.28M images) or ImageNet-22K (14M images) — datasets that are either too small to fully exploit large models or limited to specific object categories. Web-crawled datasets like LAION-5B are massive but noisy, containing duplicates, watermarked images, irrelevant content, and harmful material. DINOv2 introduces a middle path: a retrieval-based curation pipeline that uses high-quality seed datasets to mine relevant, clean, and diverse images from a large web crawl.

The pipeline works in four stages. First, a web crawl produces approximately 1.2 billion uncurated images. Second, curated seed datasets — ImageNet-22K, Google Landmarks, and several fine-grained classification datasets totaling roughly 25 million images — provide quality anchors. Third, for each seed image, the pipeline retrieves the nearest neighbors from the web crawl using pre-computed embeddings, expanding the curated core with visually similar but novel web images. Fourth, aggressive deduplication (both exact and near-duplicate) removes redundancy, producing the final LVD-142M dataset of 142 million unique, diverse, and quality-filtered images.

The impact of data curation is substantial. Training DINOv2 on LVD-142M instead of ImageNet-22K improves linear probe accuracy by 2–3 percentage points across multiple benchmarks, with particularly large gains on fine-grained tasks like Flowers-102 and iNaturalist where the curated web images provide domain diversity that ImageNet lacks. This result challenges the common assumption in self-supervised learning that algorithmic innovation is the primary driver of progress — for DINOv2, the data curation pipeline contributed as much or more than any individual architectural or loss function change.

Model Distillation

Training a ViT-g model with 1.1 billion parameters produces the strongest features, but deploying such a large model is impractical for most applications. DINOv2 addresses this through knowledge distillation: the ViT-g teacher’s representations are compressed into ViT-S (22M params), ViT-B (86M params), and ViT-L (304M params) students. Unlike standard distillation that trains a student to match the teacher’s output logits, DINOv2 uses the same self-supervised training procedure — the student learns from the ViT-g teacher using the combined DINO + iBOT objective, treating the frozen ViT-g as the momentum teacher.

The distilled models retain a remarkable fraction of the teacher’s performance. ViT-L preserves 97% of the teacher’s linear probe accuracy at less than one-third the parameters, making it practical for research and many production settings. ViT-B retains 95% at less than one-tenth the parameters, suitable for edge deployment. Even ViT-S, at just 22 million parameters, retains 92% of the teacher’s performance — far stronger than training a ViT-S from scratch with the same self-supervised objective. This distillation cascade means DINOv2’s advances are accessible across the full compute spectrum, from mobile inference to datacenter-scale processing.

Universal Feature Quality

DINOv2’s headline claim is that its frozen features match or exceed task-specific supervised models across a wide range of vision tasks. This is evaluated by attaching simple linear heads (for classification) or lightweight decoder heads (for dense prediction) to frozen DINOv2 features, without updating any backbone parameters. The breadth of this evaluation is what distinguishes DINOv2 from prior work — previous methods typically excelled at one task family but required fine-tuning for others.

On ImageNet-1K linear probe, DINOv2 ViT-g achieves 86.5%, compared to 83.5% for DINO ViT-L, 75.8% for MAE ViT-L, and 85.2% for OpenCLIP ViT-G (which uses text supervision). On ADE20K semantic segmentation with a linear decoder, DINOv2 reaches 49.0 mIoU, substantially above DINO (39.8) and MAE (46.2). On NYUd monocular depth estimation, DINOv2 frozen features achieve state-of-the-art results. On image retrieval benchmarks (Oxford and Paris), DINOv2 features outperform all previous self-supervised approaches by large margins.

The pattern across all evaluations is consistent: DINOv2 frozen features are competitive with or superior to fine-tuned features from other methods. This universality is the product of combining image-level (DINO) and patch-level (iBOT) objectives — the model simultaneously learns what an image contains and where each semantic element is located, producing representations that transfer to any task that requires understanding visual content at any spatial granularity.

Register Tokens

A follow-up paper by the same team identified and solved a subtle problem in DINOv2’s attention maps. When visualizing the self-attention patterns of large ViT models (ViT-L and ViT-g), certain background patches exhibit anomalously high attention values — attention “spikes” that concentrate on semantically uninformative regions of the image. These artifacts occur because the model repurposes low-information background patches as implicit storage for global information that doesn’t belong to any specific spatial location. The patches essentially become computational scratch space, but this corrupts their spatial representations and creates noisy attention maps that hurt dense prediction tasks.

The solution is elegant: add a small number of extra learnable tokens — called register tokens — to the input sequence alongside the [CLS] token and patch tokens. These [reg] tokens have no spatial meaning and no corresponding image content. During training, the model learns to use register tokens as explicit storage for the global information that was previously being dumped into background patches. With 4 register tokens, the attention artifacts disappear entirely: attention maps become smooth and semantically meaningful, concentrating on object boundaries and salient regions as expected.

The improvement is not merely cosmetic. Register tokens improve downstream performance on dense prediction tasks by 1–2 points because patch representations are no longer contaminated by global information leaking into spatial features. The register token approach is model-agnostic and can be applied retroactively to any Vision Transformer, making it a broadly useful architectural insight beyond DINOv2.

How DINOv2 Compares

Self-Supervised Method Comparison

How DINOv2 compares to other self-supervised and vision-language approaches on frozen feature evaluation. Linear probe accuracy and frozen segmentation (mIoU) measure feature quality without fine-tuning.

Method	Training Signal	Data	Linear Probe (%)	Frozen Seg (mIoU)	Key Advantage
DINOv2	DINO+iBOT	LVD-142M	86.5	50.1	Best frozen features across all tasks; curated data pipeline
DINO	Self-distillation	ImageNet-1K	78.2	39.4	Emergent segmentation via self-attention maps
MAE	Pixel reconstruction	ImageNet-1K	75.8	46.2	Efficient pre-training with 75% masking; good for dense tasks
CLIP	Contrastive (img+text)	WIT-400M	82.6	35.1	Zero-shot transfer via text prompts; strong classification
OpenCLIP	Contrastive (img+text)	LAION-2B	84.4	37.8	Scales well with data; open-source CLIP reproduction
iBOT	DINO+MIM	ImageNet-1K	79.5	42.6	Combines image-level and patch-level objectives

DINOv2

Training:

DINO+iBOT

Data:

LVD-142M

LP: 86.5%Seg: 50.1

Best frozen features across all tasks; curated data pipeline

DINO

Training:

Self-distillation

Data:

ImageNet-1K

LP: 78.2%Seg: 39.4

Emergent segmentation via self-attention maps

MAE

Training:

Pixel reconstruction

Data:

ImageNet-1K

LP: 75.8%Seg: 46.2

Efficient pre-training with 75% masking; good for dense tasks

CLIP

Training:

Contrastive (img+text)

Data:

WIT-400M

LP: 82.6%Seg: 35.1

Zero-shot transfer via text prompts; strong classification

OpenCLIP

Training:

Contrastive (img+text)

Data:

LAION-2B

LP: 84.4%Seg: 37.8

Scales well with data; open-source CLIP reproduction

iBOT

Training:

DINO+MIM

Data:

ImageNet-1K

LP: 79.5%Seg: 42.6

Combines image-level and patch-level objectives

DINOv2's key insight

Combining DINO + iBOT objectives yields features that excel at both image-level and pixel-level tasks
Curated LVD-142M data provides diversity without the noise of uncurated web crawls
No text supervision needed — purely visual self-supervised learning matches or beats CLIP on dense tasks

Trade-offs

Requires massive compute for ViT-g training — not reproducible without significant resources
No zero-shot text-based transfer — unlike CLIP, cannot classify with text prompts
Data curation pipeline adds complexity — seed dataset selection affects downstream quality

Key Results

Model	Linear Probe	Seg. (mIoU)	Depth (RMSE)	Notes
DINOv2 ViT-g	86.5%	49.0	0.279	LVD-142M, frozen
DINOv2 ViT-L	83.5%	47.2	0.296	Distilled from ViT-g
DINOv2 ViT-B	82.1%	44.5	0.325	Distilled from ViT-g
DINOv2 ViT-S	79.2%	40.1	0.372	Distilled from ViT-g
DINO ViT-L	83.5%	39.8	—	ImageNet-1K only
MAE ViT-L	75.8%	46.2	—	Requires fine-tuning for cls
OpenCLIP ViT-G	85.2%	41.3	—	Text supervision required

Why DINOv2 Matters

DINOv2 represents a convergence point for self-supervised visual learning. It demonstrates that three ingredients — combining complementary training objectives (image-level + patch-level), curating training data at scale, and distilling knowledge into efficient models — together produce visual features that are genuinely universal. Prior methods forced practitioners to choose: DINO for classification, MAE for segmentation, CLIP for retrieval. DINOv2 eliminates that choice by producing a single set of frozen features that works across all tasks simultaneously, often matching or exceeding fine-tuned task-specific models.

The engineering implications are profound. DINOv2 features can serve as a general-purpose visual backbone for any computer vision pipeline: freeze the backbone, attach a task-specific head, and train only the head with minimal data and compute. This dramatically reduces the cost and complexity of building vision systems, particularly for practitioners without access to large labeled datasets or extensive GPU resources. The distilled model family (ViT-S through ViT-g) makes these features accessible at every compute budget. DINOv2 effectively commoditizes strong visual representations — they are freely available, require no labels to produce, and work out of the box for virtually any visual understanding task.

Key Takeaways

Combining DINO and iBOT objectives is better than either alone — image-level self-distillation drives classification strength while patch-level masked prediction drives dense spatial understanding. The combined loss produces features that transfer universally without fine-tuning.
Curated data contributes as much as algorithmic innovation — LVD-142M, built through retrieval-based curation from web crawls, provides 2–3% improvement over ImageNet-22K. Data quality, diversity, and deduplication are first-class research contributions, not afterthoughts.
Distillation preserves almost all performance at a fraction of the cost — ViT-L retains 97% of the ViT-g teacher’s accuracy at one-third the parameters. This cascade makes state-of-the-art features practical across the full compute spectrum from mobile to datacenter.
Register tokens solve attention artifacts elegantly — adding 4 learnable tokens without spatial meaning gives the model explicit global storage, eliminating the attention spikes that corrupt dense representations in large ViTs. The fix is simple, model-agnostic, and improves downstream performance.
Frozen features can match fine-tuned models — DINOv2’s 86.5% linear probe and 49.0 mIoU segmentation with frozen features challenge the assumption that fine-tuning is necessary for strong performance. This shifts the paradigm from training task-specific backbones to training task-specific heads on universal features.

DINO — Self-distillation with Vision Transformers, the image-level objective that DINOv2 builds upon
MAE — Masked autoencoders for pixel reconstruction, complementary to DINOv2’s iBOT component
BEiT — Discrete visual token prediction that pioneered masked image modeling
I-JEPA — Joint-embedding prediction without reconstruction, an alternative to masked modeling
SimCLR — Contrastive learning framework that DINOv2’s DINO component evolved from
MoCo — Momentum contrast, the architectural precursor to DINO’s momentum teacher
BYOL — Non-contrastive learning without negatives, influencing DINO’s asymmetric design
VICReg — Variance-invariance-covariance regularization, an alternative non-contrastive approach