Paper Overview
DINO — self-DIstillation with NO labels — demonstrates that self-supervised Vision Transformers learn features containing explicit information about the semantic layout of images. When you visualize the self-attention maps from the final layer's [CLS] token, the heads naturally segment objects without any pixel-level supervision, bounding boxes, or labels of any kind.
Published at ICCV 2021 by Mathilde Caron, Hugo Touvron, Ishan Misra, and colleagues at Meta AI and Inria, DINO combines knowledge distillation with self-supervised learning through a student-teacher framework where both networks share the same architecture. The teacher is not pretrained — it is built online as an exponential moving average of the student.
The results are striking: a ViT-S/16 trained with DINO achieves 77.0% top-1 accuracy on ImageNet under linear evaluation and 45.9 Jaccard index on PASCAL VOC object segmentation — nearly double the 27.3 achieved by a supervised ViT with the same architecture. These segmentation properties emerge without any segmentation training objective.
DINO Architecture
DINO is built on self-distillation: a student network learns by matching the output distribution of a teacher network, where the teacher is simply an exponential moving average (EMA) of the student's own weights. Both networks share the same architecture — there is no separate, pretrained teacher.
The framework has four key components that work together to produce high-quality representations:
-
Multi-crop augmentation: The teacher only sees large global crops while the student processes both global and smaller local crops. This asymmetry forces the student to learn local-to-global correspondences.
-
Shared architecture: Both student and teacher use the same backbone (ViT or ResNet). The teacher's weights are an EMA of the student, not independently trained.
-
Softmax with temperature: Both networks produce probability distributions via softmax, with the teacher using a lower temperature to produce sharper predictions.
-
Cross-entropy loss: The student learns by minimizing cross-entropy between its output distribution and the teacher's output distribution across different view pairs.
Multi-Crop Training
DINO's multi-crop strategy creates an asymmetry between what the teacher and student see. The teacher receives only two global crops (covering large portions of the image, typically 50% or more), while the student processes all crops — both the global views and several smaller local crops (covering around 5% of the image area).
This asymmetry is the critical design choice. By requiring the student to match the teacher's global-view output while only seeing a small local patch, DINO forces the student to infer global semantic content from local visual information. A local crop of a dog's ear must produce a representation consistent with the teacher's representation of the entire dog.
The paper uses 2 global crops at resolution 224x224 and several local crops (typically 6-10) at resolution 96x96. The teacher only processes the two global views, while the student processes all views. The loss is computed over all cross-view pairs where the student and teacher see different views.
Loss Function
DINO minimizes the cross-entropy between the teacher's output probability distribution and the student's output distribution, computed across all valid pairs of views. Crucially, a view is never compared with itself — the loss only considers pairs where the teacher and student process different crops.
Here x1g and x2g are the two global views processed by the teacher, V is the full set of views (global and local), and H(a, b) = -Σi a(i) log b(i) is the standard cross-entropy. The teacher output appears as the target distribution and the student output as the predicted distribution.
Both networks produce K-dimensional probability distributions via softmax with temperature scaling. The teacher probability for dimension i is computed with centering to prevent collapse:
The teacher temperature τt is set lower than the student temperature τs (typically τt = 0.04 vs τs = 0.1), producing sharper probability distributions from the teacher. The centering vector c is subtracted before the softmax to prevent any single dimension from dominating.
Preventing Collapse: Centering and Sharpening
Self-distillation without labels faces a fundamental collapse risk: the teacher and student could converge to outputting the same uniform or constant distribution for all inputs. DINO prevents this through two complementary mechanisms that work in opposing directions.
Centering subtracts a running mean from the teacher's output before applying softmax. The center vector c is updated with exponential moving average over batch statistics:
where m is the momentum rate (typically 0.9) and B is the batch size. Centering prevents any single dimension from dominating the softmax output, which would cause one form of collapse where the teacher always assigns high probability to the same class regardless of input.
Sharpening uses a low teacher temperature τt to produce peaked probability distributions. While centering pushes toward uniformity by preventing dominance of any single dimension, sharpening pushes away from uniformity by amplifying the largest logits. These two forces balance each other: centering alone would cause uniform collapse, and sharpening alone would cause dominance collapse.
The paper's ablation confirms that both are essential — removing either one causes training to collapse.
EMA Teacher Update
The teacher network in DINO is not trained with gradients. Instead, its parameters are updated as an exponential moving average (EMA) of the student parameters after each training step:
The momentum coefficient λ follows a cosine schedule from 0.996 to 1.0 over the course of training. Early in training, λ = 0.996 means the teacher incorporates more of the student's rapid updates, allowing it to keep pace with the student's fast initial learning. As training progresses, λ approaches 1.0, making the teacher increasingly stable — it changes very slowly, providing a consistent target for the student.
This schedule is motivated by the observation that early training requires faster teacher adaptation to track meaningful changes in the student, while late training benefits from a nearly frozen teacher that provides a stable, high-quality reference. The cosine schedule provides a smooth interpolation between these two regimes.
The EMA approach is what makes DINO a self-distillation method: the teacher is not an external model but a temporally smoothed version of the student itself. This means the student is effectively learning from its own past, ensembled over recent training history.
Emergent Attention Maps
The most striking finding of the DINO paper is that self-supervised ViTs learn attention maps that explicitly segment objects in images. When visualizing the self-attention of the [CLS] token in the last layer, different attention heads attend to different semantic regions of the image — one head might focus on the foreground object, another on the boundary, and another on the background.
This is DINO's signature discovery. These segmentation-quality attention maps emerge purely from the self-supervised training objective. The model was never shown any segmentation masks, bounding boxes, or pixel-level labels. Yet the attention heads learn to decompose images into semantically meaningful regions.
Quantitatively, DINO ViT-S/8 achieves 45.9 Jaccard index (IoU) on PASCAL VOC object segmentation using only the attention maps — nearly doubling the 27.3 achieved by a supervised ViT-S/8 with the same architecture. On DAVIS 2017 video object segmentation, DINO features enable competitive tracking performance without any video-specific training.
This property is specific to the combination of self-supervised training and the ViT architecture. Supervised ViTs do not produce such clean attention maps, and self-supervised CNNs lack the patch-level attention mechanism entirely.
How DINO Compares
SSL Method Comparison
How DINO compares to other self-supervised and supervised learning methods on ImageNet.
| Method | Collapse Prevention | k-NN Top-1 | Linear Top-1 | Seg. IoU | Key Mechanism |
|---|---|---|---|---|---|
| DINO | Centering + sharpening | 74.5 | 77.0 | 45.9 | Self-distillation with multi-crop |
| SimCLR | Large negative sets | — | 69.3 | — | NT-Xent contrastive loss |
| BYOL | Momentum + predictor | 64.8 | 74.4 | — | Asymmetric architecture with EMA |
| SwAV | Online clustering | 65.7 | 75.3 | — | Swapped online clustering assignments |
| MoCo v3 | Momentum contrastive | — | 72.5 | — | Momentum-updated queue encoder |
| Supervised | Labels prevent it | 79.8 | 79.8 | 27.3 | Cross-entropy on labeled data |
DINO
Self-distillation with multi-crop
SimCLR
NT-Xent contrastive loss
BYOL
Asymmetric architecture with EMA
SwAV
Swapped online clustering assignments
MoCo v3
Momentum-updated queue encoder
Supervised
Cross-entropy on labeled data
DINO's unique advantage
- Segmentation IoU of 45.9 vs 27.3 supervised — emergent object boundaries without any pixel labels
- Strong k-NN performance (74.5%) shows features are well-clustered without fine-tuning
- No negative pairs or contrastive loss required — self-distillation alone prevents collapse
Trade-offs
- Momentum encoder (EMA teacher) adds memory overhead and complexity
- Multi-crop augmentation strategy increases GPU cost during training
- Strongest results with Vision Transformers — gains are smaller with convolutional backbones
Key Results
ImageNet Classification
Under linear evaluation (frozen backbone, trained linear classifier), DINO achieves strong results across architectures and patch sizes:
| Model | k-NN Accuracy | Linear Eval |
|---|---|---|
| DINO ViT-S/16 | 74.5% | 77.0% |
| DINO ViT-B/16 | 76.1% | 78.2% |
| DINO ViT-S/8 | 78.3% | 79.7% |
| BYOL ResNet-50 | 66.6% | 71.4% |
| SwAV ResNet-50 | 66.3% | 73.5% |
The k-NN accuracy is particularly noteworthy. DINO's representations are so well-structured that a simple k-nearest-neighbor classifier (no training at all) reaches within 2-3% of the linear probe. This gap is much smaller than for other methods, indicating that DINO features form tighter, more linearly separable clusters in the embedding space.
Patch Size Matters More Than Model Size
One of DINO's counterintuitive findings is that reducing patch size (increasing the number of tokens) improves performance more than increasing model capacity. ViT-S/8 (small model, 8x8 patches) achieves 79.7% linear eval — outperforming ViT-B/16 (base model, 16x16 patches) at 78.2%. More spatial tokens provide finer-grained attention and better feature resolution, which matters more than adding parameters.
Critical Ablations
The paper provides thorough ablation studies that isolate the contribution of each component:
- No momentum teacher (copying student weights directly): performance drops significantly, confirming that EMA smoothing is essential for stable training
- No multi-crop: roughly 2-3% accuracy drop, showing that the local-to-global training signal is important but not the only factor
- Adding a predictor (like BYOL's MLP predictor): no improvement, suggesting that centering and sharpening already serve the asymmetry role
- Centering removed: training collapses — uniform output distribution
- Sharpening removed (high teacher temperature): training collapses — single-dimension dominance
Why DINO Matters
DINO's significance extends beyond its classification and segmentation numbers. The emergent attention properties reveal something fundamental about what self-supervised ViTs learn: without any explicit spatial supervision, the model discovers that attending to coherent object regions is the most efficient strategy for self-distillation. The attention maps are not a designed feature — they are a consequence of the training objective and architecture.
This finding has influenced the design of subsequent models. DINOv2 scaled the approach to larger datasets and model sizes, producing a foundation model with strong performance across dense prediction tasks. The idea that self-supervised ViTs implicitly learn spatial decomposition has informed work on open-vocabulary segmentation, where DINO features serve as a spatial backbone.
The connection between knowledge distillation and self-supervised learning is also conceptually important. DINO shows that distillation does not require a pretrained teacher — the teacher can emerge from the student's own training trajectory. This collapses the distinction between distillation and self-supervised learning, suggesting they are endpoints on a spectrum rather than separate paradigms.
Key Takeaways
-
Self-distillation works without labels — a student can learn strong representations by matching the output distribution of its own EMA teacher, without any pretrained model or labeled data.
-
Attention maps emerge as object segmenters — DINO ViTs learn to segment objects through self-attention alone, achieving 45.9 IoU on PASCAL VOC without any segmentation supervision.
-
Centering and sharpening prevent collapse — these two opposing forces balance each other: centering prevents single-dimension dominance while sharpening prevents uniform collapse.
-
Spatial resolution trumps model size — reducing patch size from 16 to 8 improves performance more than scaling from ViT-S to ViT-B, because finer-grained tokens enable richer attention patterns.
-
The EMA schedule matters — cosine annealing the momentum from 0.996 to 1.0 provides fast early adaptation and stable late-training targets, both essential for convergence.
Related Reading
- VICReg — Variance-invariance-covariance regularization for self-supervised learning without collapse
- V-JEPA — Joint-embedding predictive architecture for video representation learning
- Vision Transformer — The ViT architecture that DINO trains in a self-supervised manner
- CLIP — Contrastive vision-language pretraining using a different self-supervised paradigm
- Attention Is All You Need — The transformer architecture underlying Vision Transformers
