BEiT: BERT Pre-Training of Image Transformers
How BEiT bridges BERT and vision by predicting discrete visual tokens from masked image patches — the first masked image modeling approach for Vision Transformers, achieving 83.2% on ImageNet-1K.
Explore machine learning papers and reviews related to self-supervised-learning. Find insights, analysis, and implementation details.
How BEiT bridges BERT and vision by predicting discrete visual tokens from masked image patches — the first masked image modeling approach for Vision Transformers, achieving 83.2% on ImageNet-1K.
How DINOv2 combines DINO self-distillation with iBOT masked prediction at scale on curated data (LVD-142M), producing the strongest open-source frozen visual features across classification, segmentation, depth, and retrieval.
How I-JEPA learns visual representations by predicting abstract feature representations of masked image regions — no pixel reconstruction, no augmentation — achieving 81.7% linear probe accuracy with ViT-H.
How V-JEPA 2 scales self-supervised video learning to 1M+ hours with mask denoising and 3D-RoPE, then extends to V-JEPA 2-AC — an action-conditioned world model that enables zero-shot robotic planning from just 62 hours of unlabeled video.
How self-supervised learning works without negative pairs — a predictor and momentum target network are all you need to prevent representation collapse.
How self-distillation with no labels produces Vision Transformer attention maps that automatically segment objects — without any pixel-level supervision.
How masking 75% of image patches and reconstructing pixels creates a scalable self-supervised learner that trains ViT-H to 87.8% on ImageNet-1K — 3.5× faster than full encoding, no labels required.
How a momentum-updated encoder and a dictionary queue make contrastive learning practical — large dictionaries with consistent keys, no large-batch requirement.
How a simple framework — augmentation, shared encoder, projection head, and contrastive loss — set a new standard for self-supervised visual representation learning.
How V-JEPA learns powerful video representations by predicting masked spatiotemporal regions in embedding space rather than reconstructing pixels, achieving state-of-the-art frozen features with superior label efficiency.
How variance, invariance, and covariance regularization enables self-supervised representation learning without negative pairs or momentum encoders.