Paper Overview
V-JEPA showed that predicting masked video regions in latent space — rather than reconstructing pixels — produces superior video representations. V-JEPA 2 asks the natural follow-up: what happens when you scale this idea to its limits, and what new capabilities emerge?
The answer is striking. V-JEPA 2 scales the joint-embedding predictive architecture to a ViT-g encoder (over 1 billion parameters), trains on VideoMix22M (a curated dataset spanning more than 1 million hours of internet video), and introduces two key technical improvements: mask denoising with L1 loss replaces mask prediction, and 3D Rotary Position Embeddings replace fixed sinusoidal encodings. Together, these changes push V-JEPA 2 to 77.3% top-1 on Something-Something v2 — a benchmark that requires genuine temporal reasoning — and 39.7 recall@5 on Epic-Kitchens-100 action anticipation, a 44% relative improvement over all previous methods.
But V-JEPA 2’s most remarkable contribution goes beyond classification. The authors extend V-JEPA 2 into V-JEPA 2-AC, an action-conditioned world model that can plan robot actions in latent space. Trained on just 62 hours of unlabeled robot video from the Droid dataset, V-JEPA 2-AC is deployed zero-shot on physical Franka robotic arms in two different labs — achieving 65–80% pick-and-place success by planning in 16 seconds what pixel-generation approaches like Cosmos take 4 minutes to compute. This demonstrates a path from self-supervised video understanding to embodied intelligence, all without task-specific labels, reward signals, or data from the target robots.
Why Video Understanding Needs Temporal Reasoning
Most image recognition benchmarks can be solved by analyzing appearance alone — a single frame of a dog is enough to classify “dog.” Video understanding is fundamentally harder because many actions can only be distinguished through temporal reasoning. Consider “pushing something left” versus “pushing something right” — a single frame shows a hand near an object, but the direction of motion is invisible without observing change across time.
This distinction separates video models from image models on benchmarks like Something-Something v2 (SSv2), where every action requires temporal analysis. Image-only models that recognize objects and scenes score below 50% on SSv2. V-JEPA 2’s 77.3% demonstrates that its self-supervised training on 1M+ hours of video teaches genuine temporal understanding — learning how objects move, interact, and transform over time rather than just recognizing what they look like in a single frame.
The V-JEPA 2 Pipeline
V-JEPA 2 follows the joint-embedding predictive framework established by I-JEPA and V-JEPA: an encoder processes visible context, a predictor maps this context to predictions for masked regions, and a momentum-updated target encoder provides the ground-truth embeddings. The training objective operates entirely in latent space — no pixel reconstruction, no decoder, no auxiliary losses.
The input video is divided into spatiotemporal tubelets of size 2 × 16 × 16 (2 frames × 16×16 pixels). Multi-block masking removes 85–95% of these tubelets, creating large contiguous gaps that force the model to reason about motion and semantics rather than interpolating from nearby visible patches. The encoder (ViT-g, 1B+ parameters) processes the visible tubelets using 3D Rotary Position Embeddings, and the predictor (ViT-S) maps the encoded context plus positional mask tokens to predicted embeddings. The target encoder — an exponential moving average of the main encoder with stop-gradient — processes the full video to produce target embeddings. The loss is L1 between predicted and target representations:
where E_θ is the encoder, P_φ is the predictor, E\barθ is the EMA target encoder, x represents visible patches, y represents masked patches, and \text{sg} denotes stop-gradient.
From Mask Prediction to Mask Denoising
V-JEPA removes masked patches from the encoder entirely — the encoder never sees them. This creates a hard boundary in the representation: the model has complete information about visible regions and zero information about masked regions. At the boundary between visible and masked patches, prediction quality drops sharply because the encoder’s representations contain no signal from the masked side.
V-JEPA 2 replaces this with mask denoising. Instead of removing masked patches, noise is added to their representations before they enter the encoder. The encoder now sees all patches — clean versions of visible ones and corrupted versions of masked ones. This eliminates the hard boundary: the encoder has partial (noisy) information everywhere, producing smoother representations that transition gradually between well-understood and poorly-understood regions. The predictor’s task changes from “fill in what’s missing” to “clean up what’s corrupted” — a denoising objective that provides richer gradients and more stable training, especially at scale.
The switch from L2 to L1 loss complements this change. L2 loss penalizes large errors quadratically, which can destabilize training when the encoder produces outlier representations. L1 loss treats all errors linearly, providing more robust gradients that improve training stability for the billion-parameter ViT-g model.
Progressive Resolution Training
Training a ViT-g on 64-frame video clips at 384×384 resolution from the start would require approximately 60 GPU-years on A100s. V-JEPA 2 avoids this through a progressive training strategy that starts at low resolution and scales up, achieving an 8.4× reduction in compute.
The training schedule has three stages. First, a warmup of 12,000 iterations at 16 frames and 256×256 resolution. Second, a constant phase of 228,000 iterations at the same low resolution — the bulk of training. Third, a cooldown of 12,000 iterations where both spatial resolution (256→384) and temporal length (16→64 frames) increase while the learning rate linearly decays. The model learns coarse spatiotemporal patterns during the long low-resolution phase and sharpens them during the brief high-resolution cooldown.
This works because 3D Rotary Position Embeddings (3D-RoPE) encode relative positions rather than absolute ones. Unlike fixed sinusoidal embeddings that must be interpolated when resolution changes — introducing artifacts — RoPE naturally generalizes to new resolutions by partitioning feature dimensions into three segments for temporal, height, and width axes, applying 1D rotations separately to each. This makes the resolution increase during cooldown seamless rather than disruptive.
From Understanding to Planning: V-JEPA 2-AC
V-JEPA 2-AC extends the frozen V-JEPA 2 encoder into an action-conditioned world model — a system that can predict what will happen in a scene given a sequence of robot actions. The key insight is that planning in latent space is both faster and more effective than generating pixel-level video of the future.
The architecture is a ~300M parameter transformer with 24 layers, block-causal attention, and 7D action vectors (3D position, 3D orientation, gripper state). Given the current scene encoded by the frozen V-JEPA 2 encoder and a proposed action, the predictor outputs the predicted next state — also in latent space. By chaining these predictions, the model can “imagine” multi-step futures without generating a single pixel.
Planning uses the Cross-Entropy Method (CEM) to search for action sequences whose predicted future states are closest (by L1 distance) to an encoded goal image. This search runs entirely in the compact latent space, completing in about 16 seconds. By contrast, video-generation approaches like Cosmos must render full-resolution pixel predictions for each candidate plan, taking over 4 minutes per query — a 15× slowdown.
The most remarkable aspect is the training data requirement: just 62 hours of unlabeled robot video from the Droid dataset. No task labels. No reward signals. No success indicators. The action-conditioned predictor learns purely from observing how actions cause state changes, and this minimal training is sufficient for zero-shot deployment on physical Franka arms in two different labs, achieving 65–80% pick-and-place success rates.
How V-JEPA 2 Compares
Video Self-Supervised Method Comparison
How V-JEPA 2 compares to other self-supervised and video foundation models. SSv2 and K400 accuracy measure frozen feature quality on Something-Something v2 (temporal reasoning) and Kinetics-400 (action recognition).
| Method | Architecture | Data Scale | SSv2 (%) | K400 (%) | Key Advance |
|---|---|---|---|---|---|
| V-JEPA 2 | ViT-g (1B) | VideoMix22M (1M+ hrs) | 77.3 | ~83 | Mask denoising + world model |
| V-JEPA | ViT-L (300M) | VideoMix2M | 71.2 | 82.1 | Latent prediction for video |
| VideoMAEv2 | ViT-g (1B) | UnlabeledHybrid | 77.0* | 77.0 | Pixel reconstruction at scale (*fine-tuned) |
| I-JEPA | ViT-H (632M) | ImageNet-1K | — | — (image only) | Latent prediction for images |
| InternVideo2 | ViT-6B | Mixed (multi-modal) | 77.5* | 85.0* | Multi-modal + fine-tuned (*fine-tuned) |
| DINOv2 | ViT-g (1.1B) | LVD-142M | — (image only) | — | Universal frozen image features |
V-JEPA 2
Mask denoising + world model
V-JEPA
Latent prediction for video
VideoMAEv2
Pixel reconstruction at scale (*fine-tuned)
I-JEPA
Latent prediction for images
InternVideo2
Multi-modal + fine-tuned (*fine-tuned)
DINOv2
Universal frozen image features
V-JEPA 2's insight
Mask denoising on 1M+ hours of video produces features that understand both appearance and motion — and can be extended to a world model for robotic planning with zero-shot deployment.
Trade-offs
Requires massive compute for ViT-g pre-training (~60 GPU-years at full resolution, reduced to ~7 with progressive training). Frozen SSv2 comparisons are complicated by methods that report fine-tuned numbers.
Key Results
| Benchmark | Metric | V-JEPA 2 (ViT-g) | V-JEPA (ViT-L) | Improvement |
|---|---|---|---|---|
| SSv2 | Frozen top-1 | 77.3% | 71.2% | +6.1% |
| K400 | Frozen top-1 | ~83% | 82.1% | +~1% |
| Epic-Kitchens-100 | R@5 anticipation | 39.7 | — | +44% relative |
| PerceptionTest | Video QA (8B) | 84.0% | — | — |
| TempCompass | Video QA (8B) | 76.9% | — | — |
| Droid (pick & place) | Zero-shot success | 65–80% | — | — |
Why V-JEPA 2 Matters
V-JEPA 2 demonstrates that self-supervised video models can bridge the gap from perception to action. Prior work in the JEPA family — I-JEPA, V-JEPA — focused on learning representations. V-JEPA 2 shows that the same latent space that understands video can also predict the consequences of actions and plan goal-directed behavior. This is the first concrete realization of Yann LeCun’s vision for a “world model” that learns physics and causality from observation alone, then uses that understanding to act.
The engineering implications are equally significant. V-JEPA 2-AC achieves useful robotic manipulation from 62 hours of unlabeled video — no reward engineering, no demonstration labeling, no simulation-to-real transfer. This dramatically lowers the barrier to deploying robots in new environments: record some video, post-train the world model, and plan. The progressive training strategy makes the compute cost tractable (7 GPU-years instead of 60), and the 3D-RoPE architecture ensures the model handles arbitrary video resolutions without retraining. Together, these advances make V-JEPA 2 not just a stronger video encoder but a foundation for embodied AI systems that learn from watching and act by imagining.
Key Takeaways
-
Mask denoising is better than mask prediction — adding noise to masked patches instead of removing them eliminates the hard boundary between seen and unseen regions, producing smoother representations and more stable training at billion-parameter scale.
-
Progressive training saves 8.4× compute — training at low resolution for 95% of iterations and scaling up during a brief cooldown transfers coarse understanding to fine detail, enabled by 3D-RoPE’s resolution-agnostic position encoding.
-
Temporal reasoning emerges from scale — training on 1M+ hours of video produces features that genuinely understand motion and temporal causality, scoring 77.3% on SSv2 where single-frame models fail and achieving 44% relative improvement on action anticipation.
-
Latent planning is 15× faster than pixel generation — V-JEPA 2-AC plans robot actions in 16 seconds by searching latent space, while pixel-generation approaches like Cosmos require 4+ minutes to render and evaluate candidate futures.
-
62 hours of unlabeled video enables zero-shot robotics — the action-conditioned world model learns to predict state transitions from minimal unlabeled data and deploys zero-shot on physical robots it has never seen, achieving 65–80% pick-and-place success without task labels or reward signals.
Related Reading
- V-JEPA — The direct predecessor: latent video prediction with ViT-L on VideoMix2M
- I-JEPA — Joint-embedding prediction for images, the architectural foundation
- DINOv2 — Universal frozen image features via DINO + iBOT at scale
- BEiT — Discrete token prediction that pioneered masked image modeling
- MAE — Pixel reconstruction for masked image modeling, the approach V-JEPA 2 outperforms
- DINO — Self-distillation with Vision Transformers and momentum teacher
- SimCLR — Contrastive learning framework for visual representations
- MoCo — Momentum contrast, architectural precursor to EMA-based methods
- BYOL — Non-contrastive learning without negative pairs
- VICReg — Variance-invariance-covariance regularization for SSL
