Skip to main content

IVF-PQ: Inverted File with Product Quantization

Summary
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.

A billion 768-dimensional float32 vectors is about 3 TB — far past the RAM of any single machine. IVF-PQ is how you fit that index on one box and still search it in milliseconds, by stacking the two techniques you have already met: an inverted-file partition to avoid scanning everything, and product quantization to shrink every vector to a handful of bytes.

Interactive IVF-PQ Visualization

The explorer is real: it builds IVF cells and PQ codebooks over the same GloVe vectors and measures recall@10 live as you turn the two dials — nprobe (how many cells to scan) and m (how many PQ subvectors).

Two quantizers, stacked

IVF-PQ chains a coarse quantizer and a fine one:

  1. Coarse — IVF. k-means partitions the space into cells; each vector is filed under its nearest centroid. A query only visits the nprobe nearest cells.
  2. Fine — PQ on the residual. Within a cell, what matters is the vector's offset from the centroid, the residual r = x - c. PQ encodes that residual into m byte-sized codes, so each vector costs m·log₂k bits instead of 32d:
x \;→\; \big(\,\text{cell}(x),\; \text{PQ}(x - c\text{cell})\,\big) = \big(\,\text{id},\; [q1, \dots, qm]\,\big)

Quantizing the residual rather than the raw vector is the key trick — the centroid already captures the coarse location, so PQ only has to describe the fine detail, and does it far more accurately.

Search: route, then ADC

At query time you score only the candidates in the probed cells, and you never decompress them — asymmetric distance computation sums precomputed query-to-centroid distances per subspace:

d(q, x)2 \;≈\; Σj=1m \big\lVert\, q(j) - c(j)qj \,\big\rVert2
import faiss quantizer = faiss.IndexFlatL2(d) # the IVF coarse centroids index = faiss.IndexIVFPQ(quantizer, d, n_cells, m, 8) # m subvectors, 8 bits each index.train(X) # learn cells + PQ codebooks on residuals index.add(X) # store each vector as (cell id, m-byte code) index.nprobe = 16 # cells to scan per query — the speed/recall dial D, I = index.search(queries, k=10)

The two dials

nprobe buys recall with scan time; m buys recall with memory. Measured live on the GloVe vectors above (10 cells, PQ k=16):

mbytes/veccompressionrecall @ nprobe 1@ 3@ 10
21.0200×0.620.700.71
105.040×0.650.810.84
2512.516×0.660.870.92

The recall surface in the explorer makes the joint choice visible. A typical production point — nprobe ≈ 3, m ≈ 10 — holds ~0.8 recall while scanning a third of the corpus at 40× compression. Note that even an exhaustive nprobe never reaches 1.0: PQ approximates, which is why a final re-rank on full-precision vectors is standard.

IVF-PQ vs a flat index

Flat (float32)
IVF-PQ
Memory / vector
4d bytes (3 TB at 1B×768)
m·log₂k bits (≈ GBs)
Query cost
O(N·d) — scan all
nprobe cells, ADC lookups
Recall
Exact (1.0)
~0.8–0.95, re-rank to lift
Scale
Millions
Billions on one machine
Tuning
None
nprobe (speed) + m (memory)

A graph index reaches higher recall in memory, but IVF-PQ wins the metric that matters at scale: vectors per gigabyte. When the corpus no longer fits in RAM, IVF-PQ — or its compressed cousins like ScaNN — is the default, which the ANN comparison lays out across methods.

References

  • Jégou et al. "Product Quantization for Nearest Neighbor Search" (IVF-ADC)
  • Johnson et al. "Billion-scale similarity search with GPUs" (FAISS)
  • Guo et al. "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" (ScaNN)

If you found this explanation helpful, consider sharing it with others.

Mastodon