V-JEPA: Learning Video Representations by Predicting in Latent Space

Adrien Bardes; Quentin Garrido; Jean Ponce; Xinlei Chen; Michael Rabbat; Yann LeCun; Mahmoud Assran; Nicolas Ballas

Paper Overview

Self-supervised learning from video is uniquely promising because video contains temporal structure that images lack — objects move, occlude, transform, and interact over time. A model that can predict what happens next, or fill in what it cannot see, must develop a deep understanding of the physical world. Yet most video SSL methods have followed the same strategy as image reconstruction: mask out patches and predict their pixels. V-JEPA asks a fundamental question — what if predicting pixels is the wrong objective entirely?

V-JEPA (Video Joint-Embedding Predictive Architecture) learns video representations by predicting masked spatiotemporal regions in an abstract embedding space rather than reconstructing raw pixel values. The key insight is that pixel-level prediction forces the model to allocate capacity to irrelevant low-level details — exact textures, lighting variations, compression artifacts — while latent prediction allows the model to focus on semantic content. This is not a minor architectural tweak; it represents a philosophical shift in what we ask self-supervised models to learn.

The results validate this shift decisively. Using a ViT-L/16 backbone evaluated with frozen features (no fine-tuning), V-JEPA achieves 82.1% top-1 accuracy on Kinetics-400 and 71.2% on Something-Something v2 — surpassing all prior pixel-reconstruction methods by large margins. On SSv2, a benchmark that requires genuine temporal reasoning rather than appearance shortcuts, the gap is particularly striking: V-JEPA outperforms VideoMAEv2 by over 14 percentage points with frozen features.

Published in TMLR 2024 by Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas at Meta AI / NYU, V-JEPA extends the I-JEPA framework from images to video, demonstrating that the joint-embedding predictive architecture scales naturally to spatiotemporal data. The work establishes a new paradigm for video understanding: learn by predicting abstract features, not by reconstructing pixels.

The Core Idea: Predict Features, Not Pixels

Pixel-reconstruction methods like VideoMAE and VideoMAEv2 train a decoder to reconstruct the exact RGB values of masked patches. This objective treats all pixel-level variation as equally important. But consider what a model must represent to reconstruct a video frame: the precise shade of a person's shirt, the exact pattern of grass in the background, the specific noise introduced by the camera sensor. None of these details are relevant for understanding what is happening in the video. The model wastes capacity modeling a high-entropy signal full of perceptually irrelevant information.

V-JEPA sidesteps this problem entirely by operating in a learned embedding space. Instead of reconstructing pixels, the predictor takes the embeddings of visible patches and predicts the embeddings that a separate target encoder would produce for the masked patches. Because the target encoder has already discarded low-level noise in favor of semantic features, the prediction target is inherently more meaningful. The model learns to predict what matters — object identity, motion patterns, spatial relationships — without being penalized for failing to reproduce irrelevant surface details.

Spatiotemporal Masking Strategy

The masking strategy is one of V-JEPA's most carefully designed components. Unlike image masking where spatial blocks suffice, video masking must account for temporal coherence. A mask that covers random patches across frames provides a trivially easy task — the model can interpolate from nearby unmasked patches in adjacent frames. The masking must be challenging enough to force the model to learn genuine spatiotemporal reasoning.

V-JEPA uses a multi-block masking strategy that generates short, wide spatiotemporal tubes. Each mask block spans 8 consecutive frames and covers a large spatial region (aspect ratio between 0.75 and 1.5), with the total masking ratio between 0.85 and 0.95. Critically, multiple such blocks are sampled and unioned together per training sample. This creates large contiguous regions of missing information that cannot be filled by simple interpolation — the model must reason about object motion, scene dynamics, and temporal causality. Ablations show that this multi-block strategy substantially outperforms both random masking and single-tube masking, because it creates prediction tasks that require higher-level understanding.

V-JEPA Architecture

The architecture follows the joint-embedding predictive framework: an encoder processes the visible context, a predictor maps context embeddings to predictions in the target space, and a momentum-updated target encoder provides the prediction targets. The entire system is trained end-to-end without negative pairs, contrastive losses, or pixel-level reconstruction.

Context Encoder

The context encoder is a standard Vision Transformer (ViT-L/16 or ViT-H/16) that processes only the visible (unmasked) patches from the input video. The video is first divided into non-overlapping spatiotemporal patches of size 2 × 16 × 16 (2 frames by 16x16 pixels). The visible patches are embedded, combined with positional encodings, and processed through the full transformer stack. Because only visible patches are processed — typically 5-15% of the total — the encoder never sees the masked regions and must encode enough information in its outputs to support the predictor's task.

Predictor

The predictor is a narrow transformer — 12 layers with an embedding width of 384, significantly smaller than the main encoder. It takes the context encoder's output embeddings for visible patches along with positional mask tokens for the target locations, and produces predicted embeddings for each masked position. The predictor's narrow architecture is intentional: a predictor with too much capacity could memorize the mapping rather than relying on the encoder to produce informative representations. This bottleneck forces the encoder to carry the representational burden.

Target Encoder (EMA)

The target encoder shares the same architecture as the context encoder but is not trained by gradient descent. Instead, its parameters are updated as an exponential moving average (EMA) of the context encoder's parameters:

\bar{θ} ← τ · \bar{θ} + (1 - τ) · θ

where \bar{θ} are the target encoder parameters, θ are the context encoder parameters, and τ = 0.996 is the momentum coefficient following a cosine schedule from 0.996 to 1.0. The EMA update provides a slowly evolving prediction target that stabilizes training. Without it, the target would shift too rapidly and the predictor could exploit trivial solutions. The target encoder processes the full video (all patches, including those masked from the context encoder) and produces the ground-truth embeddings that the predictor must match.

Loss Function

V-JEPA uses a simple L1 loss between the predicted and target embeddings, averaged over all masked positions:

ℒ = 1|M| Σ_{i ∈ M} \| ẑ_i - \bar{z}_i \|₁

where M is the set of masked positions, ẑ_i is the predictor's output for position i, and \bar{z}_i is the target encoder's embedding for the same position. The L1 loss is chosen over L2 because it is less sensitive to outliers and does not disproportionately penalize large errors on individual dimensions. Importantly, this loss operates entirely in the latent space — no decoder, no pixel reconstruction, no auxiliary losses. The simplicity of the objective is a feature, not a limitation.

Evaluating Frozen Features

A central claim of V-JEPA is that its pre-trained features are powerful enough to be used without fine-tuning. Most video SSL methods report results after end-to-end fine-tuning on the downstream task, which conflates the quality of the pre-trained representation with the model's ability to adapt during supervised training. Frozen evaluation isolates the representation quality: the encoder's weights are fixed, and only a lightweight probe is trained on top.

V-JEPA introduces attentive probing as its primary evaluation protocol. Unlike linear probing, which fits a single linear layer on the average-pooled representation, attentive probing uses a small cross-attention module that can attend to different spatial and temporal positions in the frozen feature map. This is important because video understanding is inherently position-dependent — the features at the location of a moving hand are more relevant for action recognition than the features of a static background. Attentive probing gives the frozen features a fair chance to demonstrate their quality without modifying them.

Label Efficiency

One of V-JEPA's most practically significant results is its label efficiency — the ability to maintain strong performance even when only a fraction of labeled data is available. In real-world settings, labeled video data is expensive to obtain because annotation requires watching and understanding temporal content, not just categorizing a static image.

When trained with only 10% of Kinetics-400 labels, V-JEPA's performance drops by approximately 12% relative to using the full labeled set. Under the same conditions, VideoMAEv2 drops by approximately 30%. This 2.5x advantage in label efficiency suggests that V-JEPA's latent representations capture more transferable structure from the unlabeled pre-training phase. The model has already learned to understand video content; the labeled data merely needs to map this understanding to specific categories.

How V-JEPA Compares

V-JEPA sits within a growing family of video self-supervised learning methods. The comparison is most meaningful when restricted to frozen-feature evaluation, which isolates representation quality from fine-tuning dynamics.

Video SSL Method Comparison

How V-JEPA compares to other video self-supervised learning methods on frozen encoder evaluation.

Method	Prediction Target	Negative Pairs	Masking	K400 (frozen)	SSv2 (frozen)	Key Mechanism
V-JEPA	Latent features	Not required	Multi-block 90%	82.1%	71.2%	Feature prediction in embedding space with EMA target encoder
VideoMAEv2	Raw pixels	Not required	Dual masking 90%	77.0%	57.0%	Pixel reconstruction with dual masking at scale
VideoMAE	Raw pixels	Not required	Tube masking 90%	72.4%	53.7%	Masked pixel reconstruction with tube masking
I-JEPA	Latent features	Not required	Multi-block	—	—	Feature prediction for images (V-JEPA predecessor)
MAE	Raw pixels	Not required	Random 75%	—	—	Masked pixel reconstruction for images
DINO	CLS token	Not required	Multi-crop	—	—	Self-distillation with EMA teacher network

V-JEPA

Prediction:

Latent features

Negatives:

Not required

Masking:

Multi-block 90%

K400: 82.1%SSv2: 71.2%

Feature prediction in embedding space with EMA target encoder

VideoMAEv2

Prediction:

Raw pixels

Negatives:

Not required

Masking:

Dual masking 90%

K400: 77.0%SSv2: 57.0%

Pixel reconstruction with dual masking at scale

VideoMAE

Prediction:

Raw pixels

Negatives:

Not required

Masking:

Tube masking 90%

K400: 72.4%SSv2: 53.7%

Masked pixel reconstruction with tube masking

I-JEPA

Prediction:

Latent features

Negatives:

Not required

Masking:

Multi-block

Feature prediction for images (V-JEPA predecessor)

MAE

Prediction:

Raw pixels

Negatives:

Not required

Masking:

Random 75%

Masked pixel reconstruction for images

DINO

Prediction:

CLS token

Negatives:

Not required

Masking:

Multi-crop

Self-distillation with EMA teacher network

V-JEPA advantage

Predicts in latent space — discards unpredictable pixel noise
+4% on K400 and +10% on SSv2 over previous best frozen features
1.5-6× more training efficient than pixel reconstruction methods

Key distinction

Pixel methods (VideoMAE) must reconstruct exact textures and noise
Latent methods (V-JEPA) only predict abstract semantic features
This enables focus on 'what matters' rather than surface details

Key Results Summary

Benchmark	Metric	V-JEPA (ViT-L)	Best Pixel Method	Improvement
K400	Frozen top-1	82.1%	77.0% (VideoMAEv2)	+5.1%
SSv2	Frozen top-1	71.2%	57.0% (VideoMAEv2)	+14.2%
ImageNet	Frozen top-1	77.4%	—	Competitive
K400 (10% labels)	Relative drop	-12%	-30% (VideoMAEv2)	2.5x better

The SSv2 result deserves particular attention. Something-Something v2 is a benchmark where actions cannot be recognized from a single frame — distinguishing "pushing something left" from "pushing something right" requires temporal reasoning. The 14.2-point gap between V-JEPA and the best pixel method on this benchmark demonstrates that latent prediction learns fundamentally better temporal representations.

Training Efficiency

V-JEPA is also more efficient to train than pixel-reconstruction methods. Because the encoder only processes visible patches (5-15% of the total), and there is no pixel-level decoder, the training cost per iteration is substantially lower. V-JEPA uses 64 A100 GPUs for pre-training, compared to the larger computational budgets required by VideoMAEv2 with its full decoder. The efficiency gain compounds with the quality advantage: V-JEPA learns better representations faster.

Critical Ablations

The paper includes thorough ablation studies that isolate the contribution of each design choice. These experiments reveal which components are essential and which are merely helpful.

Prediction target. Comparing latent prediction to pixel prediction using the same architecture and masking strategy, latent prediction outperforms by 5+ points on downstream tasks. This confirms that the prediction target, not the architecture, drives the performance gap. The model's capacity is better spent predicting what the target encoder deems important rather than reconstructing raw pixels.

Masking strategy. Multi-block masking outperforms random masking by approximately 4 points and tube masking by approximately 2 points on K400 frozen evaluation. The multi-block strategy creates harder prediction tasks that require understanding spatial and temporal context simultaneously. Random masking is too easy because nearby unmasked patches provide trivial interpolation signals.

Masking ratio. Performance peaks with 85-95% of patches masked. Lower masking ratios provide too much context, making the prediction task trivial. Higher ratios leave too little context for meaningful prediction. The optimal range is surprisingly high — the model learns best when it must predict the vast majority of the video from a tiny visible fraction.

Predictor capacity. A predictor with 12 layers and width 384 outperforms both smaller and larger predictors. Smaller predictors cannot capture the complexity of the prediction mapping. Larger predictors can shortcut the learning process — they become powerful enough to produce good predictions without forcing the encoder to learn informative representations. The bottleneck is essential.

EMA momentum. The cosine schedule from 0.996 to 1.0 outperforms both fixed momentum and lower initial values. The schedule matters because early in training the encoder changes rapidly and the target should track it somewhat closely, while later in training stability becomes more important and the target should evolve slowly.

Why V-JEPA Matters

Information filtering by design. By predicting in latent space rather than pixel space, V-JEPA builds information filtering into its objective function. The target encoder learns what is worth predicting, and the context encoder learns to support those predictions. This eliminates the capacity waste that plagues pixel-reconstruction methods — no model parameters are spent modeling camera noise, compression artifacts, or irrelevant textures.
No handcrafted augmentations. Unlike contrastive methods that rely heavily on carefully designed augmentation pipelines (random cropping, color jitter, Gaussian blur), V-JEPA uses only masking as its self-supervised signal. This makes the method less sensitive to augmentation hyperparameters and more likely to generalize across domains where standard augmentations may not be appropriate — medical imaging, satellite video, industrial inspection.
Modality-agnostic architecture. The joint-embedding predictive framework does not assume anything about the input modality. The same architecture that processes video patches could process audio spectrograms, point clouds, or any other data that can be tokenized. V-JEPA demonstrates this flexibility by extending I-JEPA from images to video with minimal architectural changes, suggesting a path toward unified self-supervised learning across modalities.
Frozen features as first-class outputs. Most SSL methods treat frozen evaluation as an afterthought — a diagnostic metric subordinate to fine-tuning performance. V-JEPA makes frozen features the primary deliverable. The 82.1% K400 accuracy with frozen features exceeds many methods' fine-tuned results with smaller backbones. This is practically significant: frozen features enable efficient deployment where the encoder runs once and a tiny probe handles multiple downstream tasks simultaneously.

Key Takeaways

Predicting in latent space is strictly better than predicting pixels for learning video representations. The performance gap is not marginal — it is 5+ points on standard benchmarks and 14+ points on temporally demanding ones.
Masking strategy matters as much as architecture. Multi-block spatiotemporal masking with 85-95% masking ratio creates prediction tasks hard enough to force genuine video understanding, while random or insufficient masking leads to trivial solutions.
The predictor bottleneck is essential. A narrow predictor (12 layers, width 384) forces the encoder to produce maximally informative representations. Larger predictors can shortcut the learning objective and degrade encoder quality.
Frozen features can match or exceed fine-tuned baselines. V-JEPA's frozen ViT-L achieves 82.1% on K400, demonstrating that high-quality pre-training eliminates the need for task-specific adaptation in many settings.
Label efficiency follows from representation quality. V-JEPA's 12% relative drop with 10% labels (vs. 30% for VideoMAEv2) shows that better pre-trained representations require less supervised data to be useful, directly reducing the annotation burden for practitioners.

VICReg — Variance-invariance-covariance regularization for self-supervised learning, another Meta AI / LeCun contribution to the SSL landscape
Vision Transformer — The ViT architecture that serves as V-JEPA's backbone encoder
Attention Is All You Need — The transformer architecture underlying both the encoder and predictor
CLIP — Contrastive vision-language learning, a different approach to self-supervised visual representation
EfficientNet — Efficient backbone design, relevant context for understanding the compute-performance tradeoffs V-JEPA navigates