Scalar quantization shrinks each dimension independently and tops out around 32×. Product quantization (PQ) goes much further — 10–100× — by quantizing whole subspaces against learned codebooks, and it is the technique underneath billion-scale vector search engines like FAISS.
Interactive Quantization Explorer
The explorer is real: it learns k-means codebooks on genuine GloVe vectors, encodes each vector to a handful of centroid indices, reconstructs from those centroids, and measures recall@10 live. Change the subvector count to walk the compression–recall curve.
From scalar to product quantization
Scalar quantization asks each dimension "which of these levels are you closest to?" Product quantization asks a smarter question of groups of dimensions: split the vector into m subvectors and, for each, learn a codebook of k representative centroids with k-means. Encoding replaces each subvector with the index of its nearest centroid:
So a d-float vector becomes m small integers — m·log₂k bits instead of 32d:
With the usual k = 256 (one byte per subvector) that is exactly 4d/m — a 768-d vector in 96 bytes at m = 8.
def train_pq(X, m, k=256): d = X.shape[1]; sub = d // m # one k-means codebook per subspace return [KMeans(k).fit(X[:, s*sub:(s+1)*sub]).cluster_centers_ for s in range(m)] def encode(x, codebooks, sub): # nearest centroid index in each subspace → m-byte code return [np.argmin(((x[s*sub:(s+1)*sub] - C) ** 2).sum(1)) for s, C in enumerate(codebooks)]
Asymmetric distance computation
The clever part is search. You never decompress the database. Instead, for a query q you precompute a small table of distances from each query subvector to all k centroids in that subspace, then a document's distance is just m table lookups summed:
This is asymmetric — the query stays full-precision while the database is quantized — which keeps far more accuracy than comparing two codes.
def adc_search(q, codes, codebooks, sub): # table[s][j] = ||q_s - centroid_j||^2 for each subspace table = [((q[s*sub:(s+1)*sub] - C) ** 2).sum(1) for s, C in enumerate(codebooks)] # each doc's distance = sum of m lookups return [sum(table[s][b] for s, b in enumerate(code)) for code in codes]
The subvector tradeoff
More subvectors means finer codes — higher recall, but a longer code. Measured live on the GloVe vectors above (k = 32 here; production uses k = 256):
| Subvectors m | Sub-dim | Compression | recall@10 |
|---|---|---|---|
| 2 | 25 | 160× | ≈ 0.78 |
| 5 | 10 | 64× | ≈ 0.84 |
| 10 | 5 | 32× | ≈ 0.88 |
| 25 | 2 | 12.8× | ≈ 0.94 |
The striking number is the top row: even at 160× compression PQ keeps ~78% of the neighbors — where scalar binary at a mere 32× already fell to ~61%. Quantizing subspaces jointly is simply far more efficient than quantizing dimensions alone.
Scalar vs product quantization
The difference is geometric. Give both schemes the same bit budget — the same number of reconstruction points — and PQ's learned centroids beat the uniform scalar grid on real vectors, because k-means places those points where the data actually is:
Beyond plain PQ
- OPQ (Optimized PQ) learns a rotation that aligns the subspaces with the codebooks before quantizing, recovering a few points of recall for free.
- Residual / additive quantization quantizes the leftover error with further codebooks, pushing accuracy higher at a fixed code length.
- IVF + PQ pairs PQ codes with a coarse partition so search only scores a fraction of the index — the configuration most FAISS deployments actually run, covered in IVF-PQ.
Best practices
- Make
mdivided. Subvectors must tile the dimension exactly; pad if needed. - Train on representative data. Codebooks are only as good as the distribution they were fit on — re-train when the corpus shifts.
- Use OPQ when you can afford it. The rotation is a one-time cost for a consistent recall gain.
- Re-rank the top candidates. Retrieve generously with PQ, then re-score the few finalists with full-precision vectors.
References
- Jégou et al. "Product Quantization for Nearest Neighbor Search"
- Ge et al. "Optimized Product Quantization"
- Babenko & Lempitsky "Additive Quantization for Extreme Vector Compression"
Related concepts
Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.
Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.
How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.
How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.
