2024
V-JEPA: Learning Video Representations by Predicting in Latent Space
How V-JEPA learns powerful video representations by predicting masked spatiotemporal regions in embedding space rather than reconstructing pixels, achieving state-of-the-art frozen features with superior label efficiency.
