Skip to main content

I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

How I-JEPA learns visual representations by predicting abstract feature representations of masked image regions — no pixel reconstruction, no augmentation — achieving 81.7% linear probe accuracy with ViT-H.

Mahmoud Assran, Quentin Duval +615 min read|Original Paper|self-supervised-learningjoint-embeddingpredictive-architecture+1
Best viewed on desktop for optimal interactive experience

Paper Overview

Self-supervised learning for vision has been dominated by two paradigms: invariance-based methods like DINO and BYOL that learn to produce identical representations for different augmented views of the same image, and reconstruction-based methods like MAE that learn to predict missing pixels from visible patches. Both paradigms come with fundamental limitations. Invariance-based methods require carefully engineered augmentation pipelines — and they can only learn invariances that the augmentations explicitly encode. Reconstruction-based methods waste model capacity predicting low-level details like exact textures, lighting gradients, and compression artifacts that carry no semantic meaning. I-JEPA offers a third path: predict abstract representations, not pixels, and do it without any hand-crafted augmentations.

I-JEPA — Image-based Joint-Embedding Predictive Architecture — learns visual representations by masking large blocks of an image and predicting their feature-level representations in a learned embedding space. A context encoder processes the visible patches, a predictor network maps context embeddings to predictions for the masked regions, and a momentum-updated target encoder provides the ground-truth representations that the predictor must match. Because the target encoder has already compressed the image into abstract features, the prediction task filters out irrelevant pixel-level noise and focuses the model on semantic content. No decoder, no pixel reconstruction, no augmentation pipeline — just predict what matters in latent space.

Published at CVPR 2023 by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas at Meta AI, I-JEPA achieves striking results. With a ViT-H/14 backbone at resolution 448, I-JEPA reaches 81.7% top-1 linear probe accuracy on ImageNet-1K — the strongest linear evaluation result among methods that do not use hand-crafted augmentations. In the low-label regime, I-JEPA demonstrates exceptional label efficiency: with only 1% of ImageNet labels, it achieves 72.4% semi-supervised accuracy compared to 59.8% for MAE, a gap of over 12 percentage points. These results confirm that predicting abstract representations produces features that are more linearly separable and more semantically meaningful than those learned through pixel reconstruction.

Predict Features, Not Pixels

Masked image modeling methods like MAE train a decoder to reconstruct the exact RGB values of masked patches. This pixel-level reconstruction target treats all visual information as equally important — the model is penalized just as heavily for mispredicting the precise shade of a background wall as for failing to capture the shape of a person standing in front of it. The consequence is that a significant fraction of the model’s capacity is spent encoding high-frequency texture details, lighting variations, and noise patterns that are irrelevant for downstream tasks like classification, detection, and segmentation. The reconstruction loss does not distinguish between what is semantically meaningful and what is perceptually irrelevant.

I-JEPA sidesteps this problem by predicting in a learned embedding space rather than in pixel space. The predictor takes the context encoder’s output for visible patches and produces predicted embeddings for the masked positions. These predictions are compared against the embeddings produced by a separate target encoder that processes the full image. The loss is the L2 distance between predicted and target embeddings, averaged over all masked positions and target blocks:

ℒ = Σi=1M \| s_θ(z\text{ctx}, mi) - \text{sg}(\bar{z}yi) \|22

where s_θ is the predictor network, z\text{ctx} is the context encoder’s output for visible patches, mi are positional mask tokens indicating where to predict, \bar{z}yi is the target encoder’s output for the i-th target block, \text{sg} denotes stop-gradient (no gradients flow through the target encoder), and M is the number of target blocks. Because the target encoder has already abstracted away irrelevant low-level details, the prediction target is inherently more semantic. The model learns to predict object shapes, spatial relationships, and scene structure — not pixel noise.

Multi-Block Masking Strategy

The masking strategy is one of I-JEPA’s most carefully designed components, and it differs fundamentally from the random patch masking used in MAE. I-JEPA generates four large target blocks, each covering approximately 15% of the image area (with scale ranging from 0.15 to 0.2 and aspect ratios between 0.75 and 1.5). These target blocks are the regions that the model must predict. The context — the patches visible to the encoder — is everything outside the union of these target blocks. A single large context block is sampled with scale between 0.85 and 1.0, and the intersection of this context block with the complement of the target blocks defines the visible region.

The choice of large, contiguous target blocks is deliberate and essential. When masked regions are small or randomly scattered (as in MAE’s random 75% masking), the model can reconstruct missing patches by interpolating from nearby visible patches — a local, low-level operation that does not require understanding the image’s semantic content. Large blocks force the model to reason about what should exist in entire regions of the image. To predict the features of a 15% block covering a dog’s head, the model cannot rely on interpolating adjacent textures; it must understand that there is a dog in the image, reason about its pose and spatial extent, and predict the appropriate abstract features. This holistic reasoning is precisely the kind of understanding that transfers to downstream tasks. Ablations confirm that large target blocks significantly outperform small or random masking strategies for representation quality.

The I-JEPA Pipeline

The I-JEPA architecture consists of three components: a context encoder, a predictor, and a target encoder. The input image is first divided into non-overlapping patches (e.g., 16×16 or 14×14 pixels). The multi-block masking strategy determines which patches are visible (context) and which are masked (targets). Only the visible patches are fed to the context encoder — a standard Vision Transformer (ViT-L or ViT-H) — which processes them with full self-attention and produces context embeddings. Because masked patches are completely excluded from the encoder’s input, the encoder never sees the target regions and must encode enough information in the visible patches to support downstream prediction.

The predictor is a relatively narrow transformer (typically 12 layers with embedding dimension 384) that takes the context encoder’s output embeddings along with positional mask tokens indicating the target positions. For each target block, the predictor produces predicted embeddings at every masked spatial position within that block. The predictor’s narrow architecture is intentional: a predictor with excessive capacity could learn the prediction mapping without requiring the context encoder to produce informative representations. The bottleneck forces the representational burden onto the encoder, which is the component whose features are ultimately used for downstream tasks.

The target encoder shares the same architecture as the context encoder but is updated via exponential moving average (EMA) rather than gradient descent. After each training step, the target encoder’s parameters \bar{θ} are updated as \bar{θ} ← τ · \bar{θ} + (1 - τ) · θ, where θ are the context encoder’s parameters and τ follows a cosine schedule from 0.996 to 1.0. The target encoder processes the full image — all patches including those masked from the context encoder — and produces the ground-truth embeddings against which the predictor’s outputs are compared. Stop-gradient is applied to the target encoder’s outputs, meaning no gradients flow back through the target branch. This EMA-plus-stop-gradient design prevents representation collapse: the target evolves slowly enough to provide a stable prediction objective, while the asymmetry between the trained context encoder and the momentum-updated target encoder breaks the symmetry that would otherwise allow both networks to converge to a trivial constant output. Crucially, I-JEPA requires no hand-crafted data augmentations — the masking itself is the only source of training signal, making the method more domain-agnostic than augmentation-dependent alternatives like DINO or BYOL.

Semantic vs Texture Features

A striking qualitative difference emerges when comparing the attention maps learned by I-JEPA and MAE. When visualizing where each model’s encoder attends in an image, I-JEPA’s attention concentrates on semantically meaningful regions — objects, object parts, and their boundaries. Given an image of a dog sitting on grass, I-JEPA’s attention heads consistently focus on the dog’s body, head, and limbs, with relatively little attention paid to the background. This behavior emerges naturally from the training objective: predicting abstract representations of large masked blocks requires understanding what objects are present and where they are located, so the encoder learns to attend to semantically informative regions.

MAE’s attention maps tell a different story. Because MAE reconstructs pixels, its encoder must capture enough information to reproduce exact textures, edges, and color gradients in the masked regions. The attention maps consequently spread across the image more uniformly, focusing heavily on textural boundaries, repetitive patterns, and high-frequency details — information that is essential for pixel reconstruction but largely irrelevant for semantic understanding. This difference in learned attention patterns directly explains I-JEPA’s superior performance on classification and its dramatically better label efficiency: features that attend to objects rather than textures are inherently more linearly separable and require less labeled data to map to semantic categories. The attention difference is not a minor qualitative observation — it is the visible manifestation of a fundamental difference in what the two objectives train the model to represent.

Label Efficiency

One of I-JEPA’s most practically significant results is its exceptional performance in the low-label regime. When only 1% of ImageNet labels are available (approximately 12,800 labeled images out of 1.28 million), I-JEPA achieves 72.4% semi-supervised top-1 accuracy. Under the same conditions, MAE achieves only 59.8% — a gap of 12.6 percentage points. This is not a marginal improvement; it represents a qualitative difference in the usefulness of the learned features. I-JEPA’s representations are sufficiently structured that a simple linear classifier can separate semantic categories even with minimal labeled examples, while MAE’s texture-heavy features require substantially more supervision to achieve comparable separation.

The label efficiency advantage extends beyond the 1% regime. At 10% labels, I-JEPA continues to outperform MAE and other reconstruction-based methods by meaningful margins. This pattern is consistent with the qualitative difference in attention maps: features that focus on objects and semantic structure are inherently more aligned with classification labels than features that emphasize textures and local patterns. For practitioners, this has direct implications — labeled data is expensive to acquire, and a self-supervised method that requires 10× fewer labels to reach a given accuracy level can dramatically reduce annotation costs. I-JEPA’s label efficiency makes it particularly attractive for domains where labeled data is scarce, such as medical imaging, satellite imagery, and industrial inspection, where collecting large labeled datasets is prohibitively expensive.

How I-JEPA Compares

Self-Supervised Method Comparison

How I-JEPA compares to other self-supervised and supervised approaches on ImageNet. I-JEPA achieves the strongest linear probe accuracy with ViT-H and dominates in low-label regimes all without augmentation engineering.

I-JEPA
Approach:
Feature prediction
Augmentation:
None
Linear: 81.7%1%: 72.4%

Abstract representation prediction without any hand-crafted augmentations

MAE
Approach:
Pixel reconstruction
Augmentation:
Random crop+flip
Linear: 75.8%1%: 59.8%

Simple pixel target, 3.5× faster training

DINO
Approach:
Self-distillation
Augmentation:
Multi-crop+color
Linear: 78.2%1%: 68.5%

Emergent segmentation in attention maps

data2vec
Approach:
Feature prediction
Augmentation:
Crop+mask
Linear: 79.8%1%: 64.2%

Multi-modal framework (text, speech, vision)

iBOT
Approach:
Distillation+MIM
Augmentation:
Multi-crop+mask
Linear: 79.5%1%:

Combines DINO image-level + BEiT patch-level

Supervised
Approach:
Cross-entropy
Augmentation:
Standard
Linear: 1%: 25.4%

Requires labeled data, saturates at scale

I-JEPA's key insight
  • Abstract prediction in representation space forces semantic understanding
  • No hand-crafted augmentations masking alone provides sufficient learning signal
  • Multi-block targets encourage both local and global feature learning
Trade-offs
  • Requires EMA target encoder additional memory and compute overhead
  • Fine-tuning accuracy slightly below MAE at full-label regime
  • Predictor architecture and masking strategy require careful tuning

Key Results

MethodBackboneLinear Probe (IN-1K)1% LabelsNotes
I-JEPAViT-H/14 44881.7%72.4%No augmentations
I-JEPAViT-H/16 22480.3%71.5%Standard resolution
MAEViT-H/1477.2%59.8%Pixel reconstruction
DINOViT-B/1678.2%Augmentation-dependent
data2vecViT-L/1679.2%Multi-modal framework

I-JEPA’s 81.7% linear probe result is particularly noteworthy because it is achieved without any hand-crafted data augmentations — the model sees only masked versions of natural images during pre-training. Methods like DINO that rely on multi-crop augmentation strategies and color jitter achieve strong linear probe numbers, but their performance is tied to the specific augmentation pipeline. I-JEPA’s augmentation-free design means its representations are not biased toward invariances that happen to be useful for ImageNet but may not transfer to other domains.

Why I-JEPA Matters

I-JEPA is a concrete realization of Yann LeCun’s vision for a new paradigm in self-supervised learning — one based on prediction in abstract representation spaces rather than pixel-level reconstruction or augmentation-driven invariance. LeCun has argued that the next generation of AI systems should learn world models by predicting abstract representations of sensory inputs, filtering out unpredictable low-level details while capturing the essential structure of the world. I-JEPA demonstrates that this vision is not merely theoretical: predicting abstract features of masked image regions produces representations that are more semantically meaningful, more linearly separable, and more label-efficient than those learned by pixel reconstruction, and it does so without requiring the carefully engineered augmentation pipelines that invariance-based methods depend on.

The broader significance extends beyond the specific numbers. I-JEPA establishes the joint-embedding predictive architecture as a viable and compelling framework for self-supervised learning. Its success on images laid the groundwork for V-JEPA’s extension to video, and the architecture’s modality-agnostic design — context encoder, predictor, EMA target encoder, no augmentations — suggests a path toward unified self-supervised learning across modalities. By demonstrating that you can learn strong visual features by simply masking patches and predicting their abstract representations, I-JEPA reduces the engineering complexity of self-supervised learning while simultaneously improving the quality of the resulting representations. The method is simpler than contrastive learning (no negatives, no augmentation pipeline), simpler than masked autoencoders (no decoder, no pixel reconstruction), and produces better features for downstream tasks — a rare combination that signals a genuine advance in the field.

Key Takeaways

  1. Predicting abstract representations outperforms predicting pixels — I-JEPA’s 81.7% linear probe accuracy with ViT-H surpasses MAE’s pixel-reconstruction approach, and the 12.6-point gap at 1% labels (72.4% vs 59.8%) demonstrates that latent prediction produces fundamentally more useful features.

  2. Large target blocks force semantic reasoning — masking four blocks of approximately 15% each creates prediction tasks that cannot be solved by local interpolation, forcing the model to develop holistic understanding of objects and scenes rather than memorizing texture patterns.

  3. No hand-crafted augmentations are needed — unlike DINO, BYOL, and SimCLR, I-JEPA uses only masking as its self-supervised signal, making the method more domain-agnostic and eliminating sensitivity to augmentation hyperparameters.

  4. The predictor bottleneck is essential — a narrow predictor (12 layers, width 384) forces the context encoder to carry the representational burden, ensuring that the encoder’s features — which are used downstream — are maximally informative.

  5. I-JEPA attention maps focus on objects, not textures — the qualitative difference in learned attention between I-JEPA (semantic regions) and MAE (textural details) directly explains I-JEPA’s superior label efficiency and transfer performance.

  • MAE — Masked autoencoders that reconstruct pixels, the primary pixel-reconstruction baseline I-JEPA outperforms
  • V-JEPA — Extension of the joint-embedding predictive architecture from images to video
  • DINO — Self-distillation with Vision Transformers using augmentation-driven invariance
  • BEiT — BERT-style pre-training for images using discrete visual tokens
  • BYOL — Bootstrap Your Own Latent, non-contrastive learning with predictor and EMA target
  • SimCLR — Contrastive framework for visual representations with augmentation-dependent learning
  • MoCo — Momentum contrast with a queue of negatives for self-supervised learning

If you found this paper review helpful, consider sharing it with others.

Mastodon