Skip to main content

Matryoshka Embeddings

Summary
Matryoshka embeddings: nested representations enabling dimension reduction by simple truncation without model retraining for flexible retrieval.

Matryoshka embeddings enable flexible dimension reduction through nested representations: train once, then deploy at any dimension by simply truncating the vector — no re-embedding, no second model.

Try it: truncate and retrieve

Drag the dimension cutoff and watch how much of a word’s nearest-neighbor ranking survives as you throw away dimensions. The vectors are real (GloVe, ordered by variance), and the similarity, fidelity, and recall numbers are all computed live.

The key observation: because information is concentrated in the early dimensions, a small prefix already reproduces most of the full-dimension ranking — so you can pick the dimension that fits your latency and memory budget at query time.

The Matryoshka principle

Like Russian nesting dolls, a Matryoshka embedding contains usable representations at many scales inside one vector:

768D [████████████████████████] 100% full 512D [████████████████] 99% production 256D [██████████] 98% balanced 128D [█████] 95% mobile / real-time 64D [██] 89% edge devices 32D [█] 82% extreme constraints

Each prefix is a complete embedding — not a lossy compression you have to decode, just the first m numbers of the same vector.

Train once, truncate anywhere

A normal encoder is trained for one output size; halving it means training a second model. A Matryoshka model is trained so every prefix is independently useful, so truncation is the dimensionality reduction:

from sentence_transformers import SentenceTransformer import torch.nn.functional as F model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5") # MRL-trained full = model.encode(texts, convert_to_tensor=True) # [n, 768] # Use any dimension — just slice, then re-normalize for cosine emb_128 = F.normalize(full[:, :128], p=2, dim=-1) # 6× smaller, ~95% quality emb_64 = F.normalize(full[:, :64], p=2, dim=-1) # 12× smaller, ~89% quality

Re-normalizing after truncation matters: cosine similarity assumes unit vectors, and a prefix of a normalized vector is not itself normalized.

How MRL trains it

Matryoshka Representation Learning sums the contrastive loss across a set of nested dimensions, so gradients push useful structure into the early dimensions:

MRL = Σm ∈ M λm · ℒ\text{contrastive}(E[:m])

where M = \{d1, \dots, dk\} is the set of dimensions, E[:m] is the first m dimensions of the embedding, and λm weights each scale.

def matryoshka_loss(embeddings, labels, dims=(768, 512, 256, 128, 64), weights=None): weights = weights or [1.0] * len(dims) loss = 0.0 for dim, w in zip(dims, weights): truncated = F.normalize(embeddings[:, :dim], p=2, dim=-1) loss += w * info_nce(truncated, labels) # contrastive loss at this scale return loss / sum(weights)

Dimension vs accuracy

The trade-off MRL buys you — pick a row to match your constraints:

DimensionRelative sizeAccuracySpeedUse case
768100%100%Research, high-quality
51267%99.2%1.5×Production servers
25633%97.5%Balanced performance
12817%94.8%Mobile, real-time
648%89.3%12×Edge devices
324%82.1%24×IoT, extreme constraints

In practice

The dominant pattern is cascaded (funnel) retrieval: filter the whole corpus cheaply at a small dimension, then rerank the survivors at full dimension — all from the same stored vectors.

def cascaded_search(query, docs, k=10): q = encode(query) # one full vector, stored once # Stage 1 — cheap 32-dim scan over everything cand = topk(sim(q[:32], docs[:, :32]), 1000) # Stage 2 — rerank the survivors at full dimension return topk(sim(q, docs[cand]), k)

Two more rules of thumb: weight smaller dimensions at least as heavily as larger ones during training (they have the hardest job), and store the full vector once — never re-embed to change dimension.

References

  • Kusupati et al. "Matryoshka Representation Learning" (2022) — the original MRL paper.
  • Nussbaum et al. "Nomic Embed: Training a Reproducible Long Context Text Embedder" — a widely used MRL-trained model.
  • OpenAI text-embedding-3 — production embeddings with native dimension truncation.

If you found this explanation helpful, consider sharing it with others.

Mastodon