A billion 768-dimensional float32 vectors is about 3 TB — far past the RAM of any single machine. IVF-PQ is how you fit that index on one box and still search it in milliseconds, by stacking the two techniques you have already met: an inverted-file partition to avoid scanning everything, and product quantization to shrink every vector to a handful of bytes.
Interactive IVF-PQ Visualization
The explorer is real: it builds IVF cells and PQ codebooks over the same GloVe vectors and measures recall@10 live as you turn the two dials — nprobe (how many cells to scan) and m (how many PQ subvectors).
Two quantizers, stacked
IVF-PQ chains a coarse quantizer and a fine one:
- Coarse — IVF. k-means partitions the space into cells; each vector is filed under its nearest centroid. A query only visits the
nprobenearest cells. - Fine — PQ on the residual. Within a cell, what matters is the vector's offset from the centroid, the residual r = x - c. PQ encodes that residual into
mbyte-sized codes, so each vector costsm·log₂kbits instead of32d:
Quantizing the residual rather than the raw vector is the key trick — the centroid already captures the coarse location, so PQ only has to describe the fine detail, and does it far more accurately.
Search: route, then ADC
At query time you score only the candidates in the probed cells, and you never decompress them — asymmetric distance computation sums precomputed query-to-centroid distances per subspace:
import faiss quantizer = faiss.IndexFlatL2(d) # the IVF coarse centroids index = faiss.IndexIVFPQ(quantizer, d, n_cells, m, 8) # m subvectors, 8 bits each index.train(X) # learn cells + PQ codebooks on residuals index.add(X) # store each vector as (cell id, m-byte code) index.nprobe = 16 # cells to scan per query — the speed/recall dial D, I = index.search(queries, k=10)
The two dials
nprobe buys recall with scan time; m buys recall with memory. Measured live on the GloVe vectors above (10 cells, PQ k=16):
| m | bytes/vec | compression | recall @ nprobe 1 | @ 3 | @ 10 |
|---|---|---|---|---|---|
| 2 | 1.0 | 200× | 0.62 | 0.70 | 0.71 |
| 10 | 5.0 | 40× | 0.65 | 0.81 | 0.84 |
| 25 | 12.5 | 16× | 0.66 | 0.87 | 0.92 |
The recall surface in the explorer makes the joint choice visible. A typical production point — nprobe ≈ 3, m ≈ 10 — holds ~0.8 recall while scanning a third of the corpus at 40× compression. Note that even an exhaustive nprobe never reaches 1.0: PQ approximates, which is why a final re-rank on full-precision vectors is standard.
IVF-PQ vs a flat index
A graph index reaches higher recall in memory, but IVF-PQ wins the metric that matters at scale: vectors per gigabyte. When the corpus no longer fits in RAM, IVF-PQ — or its compressed cousins like ScaNN — is the default, which the ANN comparison lays out across methods.
References
- Jégou et al. "Product Quantization for Nearest Neighbor Search" (IVF-ADC)
- Johnson et al. "Billion-scale similarity search with GPUs" (FAISS)
- Guo et al. "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" (ScaNN)
Related concepts
How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.
How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.
Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.
Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.
Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.
Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.
