BEiT: BERT Pre-Training of Image Transformers
How BEiT bridges BERT and vision by predicting discrete visual tokens from masked image patches — the first masked image modeling approach for Vision Transformers, achieving 83.2% on ImageNet-1K.
Explore machine learning papers and reviews related to vision-transformer. Find insights, analysis, and implementation details.
How BEiT bridges BERT and vision by predicting discrete visual tokens from masked image patches — the first masked image modeling approach for Vision Transformers, achieving 83.2% on ImageNet-1K.
How DINOv2 combines DINO self-distillation with iBOT masked prediction at scale on curated data (LVD-142M), producing the strongest open-source frozen visual features across classification, segmentation, depth, and retrieval.
How I-JEPA learns visual representations by predicting abstract feature representations of masked image regions — no pixel reconstruction, no augmentation — achieving 81.7% linear probe accuracy with ViT-H.
How V-JEPA 2 scales self-supervised video learning to 1M+ hours with mask denoising and 3D-RoPE, then extends to V-JEPA 2-AC — an action-conditioned world model that enables zero-shot robotic planning from just 62 hours of unlabeled video.
How self-distillation with no labels produces Vision Transformer attention maps that automatically segment objects — without any pixel-level supervision.
How masking 75% of image patches and reconstructing pixels creates a scalable self-supervised learner that trains ViT-H to 87.8% on ImageNet-1K — 3.5× faster than full encoding, no labels required.