masking Papers | Abhik Sarkar

2024

V-JEPA: Learning Video Representations by Predicting in Latent Space

self-supervised-learning video-understanding representation-learning vision-transformers masking

How V-JEPA learns powerful video representations by predicting masked spatiotemporal regions in embedding space rather than reconstructing pixels, achieving state-of-the-art frozen features with superior label efficiency.

Read review Original Paper

masking

Papers Related to masking

V-JEPA: Learning Video Representations by Predicting in Latent Space