Matryoshka embeddings enable flexible dimension reduction through nested representations: train once, then deploy at any dimension by simply truncating the vector — no re-embedding, no second model.
Try it: truncate and retrieve
Drag the dimension cutoff and watch how much of a word’s nearest-neighbor ranking survives as you throw away dimensions. The vectors are real (GloVe, ordered by variance), and the similarity, fidelity, and recall numbers are all computed live.
The key observation: because information is concentrated in the early dimensions, a small prefix already reproduces most of the full-dimension ranking — so you can pick the dimension that fits your latency and memory budget at query time.
The Matryoshka principle
Like Russian nesting dolls, a Matryoshka embedding contains usable representations at many scales inside one vector:
768D [████████████████████████] 100% full 512D [████████████████] 99% production 256D [██████████] 98% balanced 128D [█████] 95% mobile / real-time 64D [██] 89% edge devices 32D [█] 82% extreme constraints
Each prefix is a complete embedding — not a lossy compression you have to decode, just the first m numbers of the same vector.
Train once, truncate anywhere
A normal encoder is trained for one output size; halving it means training a second model. A Matryoshka model is trained so every prefix is independently useful, so truncation is the dimensionality reduction:
from sentence_transformers import SentenceTransformer import torch.nn.functional as F model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5") # MRL-trained full = model.encode(texts, convert_to_tensor=True) # [n, 768] # Use any dimension — just slice, then re-normalize for cosine emb_128 = F.normalize(full[:, :128], p=2, dim=-1) # 6× smaller, ~95% quality emb_64 = F.normalize(full[:, :64], p=2, dim=-1) # 12× smaller, ~89% quality
Re-normalizing after truncation matters: cosine similarity assumes unit vectors, and a prefix of a normalized vector is not itself normalized.
How MRL trains it
Matryoshka Representation Learning sums the contrastive loss across a set of nested dimensions, so gradients push useful structure into the early dimensions:
where M = \{d1, \dots, dk\} is the set of dimensions, E[:m] is the first m dimensions of the embedding, and λm weights each scale.
def matryoshka_loss(embeddings, labels, dims=(768, 512, 256, 128, 64), weights=None): weights = weights or [1.0] * len(dims) loss = 0.0 for dim, w in zip(dims, weights): truncated = F.normalize(embeddings[:, :dim], p=2, dim=-1) loss += w * info_nce(truncated, labels) # contrastive loss at this scale return loss / sum(weights)
Dimension vs accuracy
The trade-off MRL buys you — pick a row to match your constraints:
| Dimension | Relative size | Accuracy | Speed | Use case |
|---|---|---|---|---|
| 768 | 100% | 100% | 1× | Research, high-quality |
| 512 | 67% | 99.2% | 1.5× | Production servers |
| 256 | 33% | 97.5% | 3× | Balanced performance |
| 128 | 17% | 94.8% | 6× | Mobile, real-time |
| 64 | 8% | 89.3% | 12× | Edge devices |
| 32 | 4% | 82.1% | 24× | IoT, extreme constraints |
In practice
The dominant pattern is cascaded (funnel) retrieval: filter the whole corpus cheaply at a small dimension, then rerank the survivors at full dimension — all from the same stored vectors.
def cascaded_search(query, docs, k=10): q = encode(query) # one full vector, stored once # Stage 1 — cheap 32-dim scan over everything cand = topk(sim(q[:32], docs[:, :32]), 1000) # Stage 2 — rerank the survivors at full dimension return topk(sim(q, docs[cand]), k)
Two more rules of thumb: weight smaller dimensions at least as heavily as larger ones during training (they have the hardest job), and store the full vector once — never re-embed to change dimension.
References
- Kusupati et al. "Matryoshka Representation Learning" (2022) — the original MRL paper.
- Nussbaum et al. "Nomic Embed: Training a Reproducible Long Context Text Embedder" — a widely used MRL-trained model.
- OpenAI
text-embedding-3— production embeddings with native dimension truncation.
Related concepts
How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.
How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.
Explore the fundamental data structures powering vector databases: trees, graphs, hash tables, and hybrid approaches for efficient similarity search.
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.
Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.
