DINO: Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron; Hugo Touvron; Ishan Misra; Hervé Jégou; Julien Mairal; Piotr Bojanowski; Armand Joulin

Paper Overview

DINO — self-DIstillation with NO labels — demonstrates that self-supervised Vision Transformers learn features containing explicit information about the semantic layout of images. When you visualize the self-attention maps from the final layer's [CLS] token, the heads naturally segment objects without any pixel-level supervision, bounding boxes, or labels of any kind.

Published at ICCV 2021 by Mathilde Caron, Hugo Touvron, Ishan Misra, and colleagues at Meta AI and Inria, DINO combines knowledge distillation with self-supervised learning through a student-teacher framework where both networks share the same architecture. The teacher is not pretrained — it is built online as an exponential moving average of the student.

The results are striking: a ViT-S/16 trained with DINO achieves 77.0% top-1 accuracy on ImageNet under linear evaluation and 45.9 Jaccard index on PASCAL VOC object segmentation — nearly double the 27.3 achieved by a supervised ViT with the same architecture. These segmentation properties emerge without any segmentation training objective.

DINO Architecture

DINO is built on self-distillation: a student network learns by matching the output distribution of a teacher network, where the teacher is simply an exponential moving average (EMA) of the student's own weights. Both networks share the same architecture — there is no separate, pretrained teacher.

The framework has four key components that work together to produce high-quality representations:

Multi-crop augmentation: The teacher only sees large global crops while the student processes both global and smaller local crops. This asymmetry forces the student to learn local-to-global correspondences.
Shared architecture: Both student and teacher use the same backbone (ViT or ResNet). The teacher's weights are an EMA of the student, not independently trained.
Softmax with temperature: Both networks produce probability distributions via softmax, with the teacher using a lower temperature to produce sharper predictions.
Cross-entropy loss: The student learns by minimizing cross-entropy between its output distribution and the teacher's output distribution across different view pairs.

Multi-Crop Training

DINO's multi-crop strategy creates an asymmetry between what the teacher and student see. The teacher receives only two global crops (covering large portions of the image, typically 50% or more), while the student processes all crops — both the global views and several smaller local crops (covering around 5% of the image area).

This asymmetry is the critical design choice. By requiring the student to match the teacher's global-view output while only seeing a small local patch, DINO forces the student to infer global semantic content from local visual information. A local crop of a dog's ear must produce a representation consistent with the teacher's representation of the entire dog.

The paper uses 2 global crops at resolution 224x224 and several local crops (typically 6-10) at resolution 96x96. The teacher only processes the two global views, while the student processes all views. The loss is computed over all cross-view pairs where the student and teacher see different views.

Loss Function

DINO minimizes the cross-entropy between the teacher's output probability distribution and the student's output distribution, computed across all valid pairs of views. Crucially, a view is never compared with itself — the loss only considers pairs where the teacher and student process different crops.

ℒ = Σ_{x ∈ \{x₁^g, x₂^g\}} Σ_{\substack{x' ∈ V \ x' ≠ x}} H(P_t(x), P_s(x'))

Here x₁^g and x₂^g are the two global views processed by the teacher, V is the full set of views (global and local), and H(a, b) = -Σ_i a⁽ⁱ⁾ log b⁽ⁱ⁾ is the standard cross-entropy. The teacher output appears as the target distribution and the student output as the predicted distribution.

Both networks produce K-dimensional probability distributions via softmax with temperature scaling. The teacher probability for dimension i is computed with centering to prevent collapse:

P_t(x)⁽ⁱ⁾ = exp((g_{θ_t}(x)⁽ⁱ⁾ - c⁽ⁱ⁾) / τ_t)Σ_k=1^K exp((g_{θ_t}(x)^(k) - c^(k)) / τ_t)

The teacher temperature τ_t is set lower than the student temperature τ_s (typically τ_t = 0.04 vs τ_s = 0.1), producing sharper probability distributions from the teacher. The centering vector c is subtracted before the softmax to prevent any single dimension from dominating.

Preventing Collapse: Centering and Sharpening

Self-distillation without labels faces a fundamental collapse risk: the teacher and student could converge to outputting the same uniform or constant distribution for all inputs. DINO prevents this through two complementary mechanisms that work in opposing directions.

Centering subtracts a running mean from the teacher's output before applying softmax. The center vector c is updated with exponential moving average over batch statistics:

c ← mc + (1 - m)1BΣ_i=1^B g_{θ_t}(x_i)

where m is the momentum rate (typically 0.9) and B is the batch size. Centering prevents any single dimension from dominating the softmax output, which would cause one form of collapse where the teacher always assigns high probability to the same class regardless of input.

Sharpening uses a low teacher temperature τ_t to produce peaked probability distributions. While centering pushes toward uniformity by preventing dominance of any single dimension, sharpening pushes away from uniformity by amplifying the largest logits. These two forces balance each other: centering alone would cause uniform collapse, and sharpening alone would cause dominance collapse.

The paper's ablation confirms that both are essential — removing either one causes training to collapse.

EMA Teacher Update

The teacher network in DINO is not trained with gradients. Instead, its parameters are updated as an exponential moving average (EMA) of the student parameters after each training step:

θ_t ← λ θ_t + (1 - λ)θ_s

The momentum coefficient λ follows a cosine schedule from 0.996 to 1.0 over the course of training. Early in training, λ = 0.996 means the teacher incorporates more of the student's rapid updates, allowing it to keep pace with the student's fast initial learning. As training progresses, λ approaches 1.0, making the teacher increasingly stable — it changes very slowly, providing a consistent target for the student.

This schedule is motivated by the observation that early training requires faster teacher adaptation to track meaningful changes in the student, while late training benefits from a nearly frozen teacher that provides a stable, high-quality reference. The cosine schedule provides a smooth interpolation between these two regimes.

The EMA approach is what makes DINO a self-distillation method: the teacher is not an external model but a temporally smoothed version of the student itself. This means the student is effectively learning from its own past, ensembled over recent training history.

Emergent Attention Maps

The most striking finding of the DINO paper is that self-supervised ViTs learn attention maps that explicitly segment objects in images. When visualizing the self-attention of the [CLS] token in the last layer, different attention heads attend to different semantic regions of the image — one head might focus on the foreground object, another on the boundary, and another on the background.

This is DINO's signature discovery. These segmentation-quality attention maps emerge purely from the self-supervised training objective. The model was never shown any segmentation masks, bounding boxes, or pixel-level labels. Yet the attention heads learn to decompose images into semantically meaningful regions.

Quantitatively, DINO ViT-S/8 achieves 45.9 Jaccard index (IoU) on PASCAL VOC object segmentation using only the attention maps — nearly doubling the 27.3 achieved by a supervised ViT-S/8 with the same architecture. On DAVIS 2017 video object segmentation, DINO features enable competitive tracking performance without any video-specific training.

This property is specific to the combination of self-supervised training and the ViT architecture. Supervised ViTs do not produce such clean attention maps, and self-supervised CNNs lack the patch-level attention mechanism entirely.

How DINO Compares

SSL Method Comparison

How DINO compares to other self-supervised and supervised learning methods on ImageNet.

Method	Collapse Prevention	k-NN Top-1	Linear Top-1	Seg. IoU	Key Mechanism
DINO	Centering + sharpening	74.5	77.0	45.9	Self-distillation with multi-crop
SimCLR	Large negative sets	—	69.3	—	NT-Xent contrastive loss
BYOL	Momentum + predictor	64.8	74.4	—	Asymmetric architecture with EMA
SwAV	Online clustering	65.7	75.3	—	Swapped online clustering assignments
MoCo v3	Momentum contrastive	—	72.5	—	Momentum-updated queue encoder
Supervised	Labels prevent it	79.8	79.8	27.3	Cross-entropy on labeled data

DINO

Collapse:

Centering + sharpening

k-NN Top-1:74.5

Linear Top-1:77.0

Seg. IoU:45.9

Self-distillation with multi-crop

SimCLR

Collapse:

Large negative sets

Linear Top-1:69.3

NT-Xent contrastive loss

BYOL

Collapse:

Momentum + predictor

k-NN Top-1:64.8

Linear Top-1:74.4

Asymmetric architecture with EMA

SwAV

Collapse:

Online clustering

k-NN Top-1:65.7

Linear Top-1:75.3

Swapped online clustering assignments

MoCo v3

Collapse:

Momentum contrastive

Linear Top-1:72.5

Momentum-updated queue encoder

Supervised

Collapse:

Labels prevent it

k-NN Top-1:79.8

Linear Top-1:79.8

Seg. IoU:27.3

Cross-entropy on labeled data

DINO's unique advantage

Segmentation IoU of 45.9 vs 27.3 supervised — emergent object boundaries without any pixel labels
Strong k-NN performance (74.5%) shows features are well-clustered without fine-tuning
No negative pairs or contrastive loss required — self-distillation alone prevents collapse

Trade-offs

Momentum encoder (EMA teacher) adds memory overhead and complexity
Multi-crop augmentation strategy increases GPU cost during training
Strongest results with Vision Transformers — gains are smaller with convolutional backbones

Key Results

ImageNet Classification

Under linear evaluation (frozen backbone, trained linear classifier), DINO achieves strong results across architectures and patch sizes:

Model	k-NN Accuracy	Linear Eval
DINO ViT-S/16	74.5%	77.0%
DINO ViT-B/16	76.1%	78.2%
DINO ViT-S/8	78.3%	79.7%
BYOL ResNet-50	66.6%	71.4%
SwAV ResNet-50	66.3%	73.5%

The k-NN accuracy is particularly noteworthy. DINO's representations are so well-structured that a simple k-nearest-neighbor classifier (no training at all) reaches within 2-3% of the linear probe. This gap is much smaller than for other methods, indicating that DINO features form tighter, more linearly separable clusters in the embedding space.

Patch Size Matters More Than Model Size

One of DINO's counterintuitive findings is that reducing patch size (increasing the number of tokens) improves performance more than increasing model capacity. ViT-S/8 (small model, 8x8 patches) achieves 79.7% linear eval — outperforming ViT-B/16 (base model, 16x16 patches) at 78.2%. More spatial tokens provide finer-grained attention and better feature resolution, which matters more than adding parameters.

Critical Ablations

The paper provides thorough ablation studies that isolate the contribution of each component:

No momentum teacher (copying student weights directly): performance drops significantly, confirming that EMA smoothing is essential for stable training
No multi-crop: roughly 2-3% accuracy drop, showing that the local-to-global training signal is important but not the only factor
Adding a predictor (like BYOL's MLP predictor): no improvement, suggesting that centering and sharpening already serve the asymmetry role
Centering removed: training collapses — uniform output distribution
Sharpening removed (high teacher temperature): training collapses — single-dimension dominance

Why DINO Matters

DINO's significance extends beyond its classification and segmentation numbers. The emergent attention properties reveal something fundamental about what self-supervised ViTs learn: without any explicit spatial supervision, the model discovers that attending to coherent object regions is the most efficient strategy for self-distillation. The attention maps are not a designed feature — they are a consequence of the training objective and architecture.

This finding has influenced the design of subsequent models. DINOv2 scaled the approach to larger datasets and model sizes, producing a foundation model with strong performance across dense prediction tasks. The idea that self-supervised ViTs implicitly learn spatial decomposition has informed work on open-vocabulary segmentation, where DINO features serve as a spatial backbone.

The connection between knowledge distillation and self-supervised learning is also conceptually important. DINO shows that distillation does not require a pretrained teacher — the teacher can emerge from the student's own training trajectory. This collapses the distinction between distillation and self-supervised learning, suggesting they are endpoints on a spectrum rather than separate paradigms.

Key Takeaways

Self-distillation works without labels — a student can learn strong representations by matching the output distribution of its own EMA teacher, without any pretrained model or labeled data.
Attention maps emerge as object segmenters — DINO ViTs learn to segment objects through self-attention alone, achieving 45.9 IoU on PASCAL VOC without any segmentation supervision.
Centering and sharpening prevent collapse — these two opposing forces balance each other: centering prevents single-dimension dominance while sharpening prevents uniform collapse.
Spatial resolution trumps model size — reducing patch size from 16 to 8 improves performance more than scaling from ViT-S to ViT-B, because finer-grained tokens enable richer attention patterns.
The EMA schedule matters — cosine annealing the momentum from 0.996 to 1.0 provides fast early adaptation and stable late-training targets, both essential for convergence.

VICReg — Variance-invariance-covariance regularization for self-supervised learning without collapse
V-JEPA — Joint-embedding predictive architecture for video representation learning
Vision Transformer — The ViT architecture that DINO trains in a self-supervised manner
CLIP — Contrastive vision-language pretraining using a different self-supervised paradigm
Attention Is All You Need — The transformer architecture underlying Vision Transformers