Tagged with

vision-transformer

Explore machine learning papers and reviews related to vision-transformer. Find insights, analysis, and implementation details.

Papers Found

Back to all papers

Papers Related to vision-transformer

2022

BEiT: BERT Pre-Training of Image Transformers

self-supervised-learning masked-image-modeling visual-tokenizer vision-transformer

How BEiT bridges BERT and vision by predicting discrete visual tokens from masked image patches — the first masked image modeling approach for Vision Transformers, achieving 83.2% on ImageNet-1K.

Read review Original Paper

2024

DINOv2: Learning Robust Visual Features without Supervision

self-supervised-learning foundation-model knowledge-distillation vision-transformer

How DINOv2 combines DINO self-distillation with iBOT masked prediction at scale on curated data (LVD-142M), producing the strongest open-source frozen visual features across classification, segmentation, depth, and retrieval.

Read review Original Paper

2023

I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

self-supervised-learning joint-embedding predictive-architecture vision-transformer

How I-JEPA learns visual representations by predicting abstract feature representations of masked image regions — no pixel reconstruction, no augmentation — achieving 81.7% linear probe accuracy with ViT-H.

Read review Original Paper

2025

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

self-supervised-learning video-understanding world-model robotics vision-transformer

How V-JEPA 2 scales self-supervised video learning to 1M+ hours with mask denoising and 3D-RoPE, then extends to V-JEPA 2-AC — an action-conditioned world model that enables zero-shot robotic planning from just 62 hours of unlabeled video.

Read review Original Paper

2021

DINO: Emerging Properties in Self-Supervised Vision Transformers

self-supervised-learning vision-transformer knowledge-distillation attention-maps object-segmentation

How self-distillation with no labels produces Vision Transformer attention maps that automatically segment objects — without any pixel-level supervision.

Read review Original Paper

2022

MAE: Masked Autoencoders Are Scalable Self-Supervised Learners

self-supervised-learning masked-autoencoders vision-transformer representation-learning

How masking 75% of image patches and reconstructing pixels creates a scalable self-supervised learner that trains ViT-H to 87.8% on ImageNet-1K — 3.5× faster than full encoding, no labels required.

Read review Original Paper