IVF-PQ: Inverted File with Product Quantization

Summary: Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.

A billion 768-dimensional float32 vectors is about 3 TB — far past the RAM of any single machine. IVF-PQ is how you fit that index on one box and still search it in milliseconds, by stacking the two techniques you have already met: an inverted-file partition to avoid scanning everything, and product quantization to shrink every vector to a handful of bytes.

Interactive IVF-PQ Visualization

The explorer is real: it builds IVF cells and PQ codebooks over the same GloVe vectors and measures recall@10 live as you turn the two dials — nprobe (how many cells to scan) and m (how many PQ subvectors).

Two quantizers, stacked

IVF-PQ chains a coarse quantizer and a fine one:

Coarse — IVF. k-means partitions the space into cells; each vector is filed under its nearest centroid. A query only visits the nprobe nearest cells.
Fine — PQ on the residual. Within a cell, what matters is the vector's offset from the centroid, the residual r = x - c. PQ encodes that residual into m byte-sized codes, so each vector costs m·log₂k bits instead of 32d:

x \;→\; \big(\,\text{cell}(x),\; \text{PQ}(x - c_\text{cell})\,\big) = \big(\,\text{id},\; [q₁, \dots, q_m]\,\big)

Quantizing the residual rather than the raw vector is the key trick — the centroid already captures the coarse location, so PQ only has to describe the fine detail, and does it far more accurately.

Search: route, then ADC

At query time you score only the candidates in the probed cells, and you never decompress them — asymmetric distance computation sums precomputed query-to-centroid distances per subspace:

d(q, x)² \;≈\; Σ_j=1^m \big\lVert\, q^(j) - c^(j)_{q_j} \,\big\rVert²

import faiss
quantizer = faiss.IndexFlatL2(d)                       # the IVF coarse centroids
index = faiss.IndexIVFPQ(quantizer, d, n_cells, m, 8)  # m subvectors, 8 bits each
index.train(X)        # learn cells + PQ codebooks on residuals
index.add(X)          # store each vector as (cell id, m-byte code)
index.nprobe = 16     # cells to scan per query — the speed/recall dial
D, I = index.search(queries, k=10)

The two dials

nprobe buys recall with scan time; m buys recall with memory. Measured live on the GloVe vectors above (10 cells, PQ k=16):

m	bytes/vec	compression	recall @ nprobe 1	@ 3	@ 10
2	1.0	200×	0.62	0.70	0.71
10	5.0	40×	0.65	0.81	0.84
25	12.5	16×	0.66	0.87	0.92

The recall surface in the explorer makes the joint choice visible. A typical production point — nprobe ≈ 3, m ≈ 10 — holds ~0.8 recall while scanning a third of the corpus at 40× compression. Note that even an exhaustive nprobe never reaches 1.0: PQ approximates, which is why a final re-rank on full-precision vectors is standard.

IVF-PQ vs a flat index

Aspect

Flat (float32)

IVF-PQ

Memory / vector

4d bytes (3 TB at 1B×768)

m·log₂k bits (≈ GBs)

Query cost

O(N·d) — scan all

nprobe cells, ADC lookups

Recall

Exact (1.0)

~0.8–0.95, re-rank to lift

Scale

Millions

Billions on one machine

Tuning

None

nprobe (speed) + m (memory)

A graph index reaches higher recall in memory, but IVF-PQ wins the metric that matters at scale: vectors per gigabyte. When the corpus no longer fits in RAM, IVF-PQ — or its compressed cousins like ScaNN — is the default, which the ANN comparison lays out across methods.

References

Jégou et al. "Product Quantization for Nearest Neighbor Search" (IVF-ADC)
Johnson et al. "Billion-scale similarity search with GPUs" (FAISS)
Guo et al. "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" (ScaNN)

Embeddings & Retrieval

HNSW vs IVF-PQ vs LSH: Approximate Nearest Neighbor Algorithms Compared

How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.

Embeddings & Retrieval

HNSW: Hierarchical Navigable Small World

How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.

Embeddings & Retrieval

LSH: Locality Sensitive Hashing

Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.

Embeddings & Retrieval

Quantization Effects Simulator

Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.

Embeddings & Retrieval

Vector Quantization Techniques

Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.

Embeddings & Retrieval

Binary Embeddings for Fast Search

Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.

IVF-PQ: Inverted File with Product Quantization

Interactive IVF-PQ Visualization

Two quantizers, stacked

Search: route, then ADC

The two dials

IVF-PQ vs a flat index

References

Related concepts