Vector Quantization Techniques

Scalar quantization shrinks each dimension independently and tops out around 32×. Product quantization (PQ) goes much further — 10–100× — by quantizing whole subspaces against learned codebooks, and it is the technique underneath billion-scale vector search engines like FAISS.

Interactive Quantization Explorer

The explorer is real: it learns k-means codebooks on genuine GloVe vectors, encodes each vector to a handful of centroid indices, reconstructs from those centroids, and measures recall@10 live. Change the subvector count to walk the compression–recall curve.

From scalar to product quantization

Scalar quantization asks each dimension "which of these levels are you closest to?" Product quantization asks a smarter question of groups of dimensions: split the vector into m subvectors and, for each, learn a codebook of k representative centroids with k-means. Encoding replaces each subvector with the index of its nearest centroid:

x ∈ ℝ^d \;→\; [\,b₁, b₂, \dots, b_m\,] ∈ \{0, \dots, k-1\}^m

So a d-float vector becomes m small integers — m·log₂k bits instead of 32d:

\text{compression} = 32\,dm \,log₂ k

With the usual k = 256 (one byte per subvector) that is exactly 4d/m — a 768-d vector in 96 bytes at m = 8.

def train_pq(X, m, k=256):
    d = X.shape[1]; sub = d // m
    # one k-means codebook per subspace
    return [KMeans(k).fit(X[:, s*sub:(s+1)*sub]).cluster_centers_
            for s in range(m)]

def encode(x, codebooks, sub):
    # nearest centroid index in each subspace → m-byte code
    return [np.argmin(((x[s*sub:(s+1)*sub] - C) ** 2).sum(1))
            for s, C in enumerate(codebooks)]

Asymmetric distance computation

The clever part is search. You never decompress the database. Instead, for a query q you precompute a small table of distances from each query subvector to all k centroids in that subspace, then a document's distance is just m table lookups summed:

d(q, x)² \;≈\; Σ_s=1^m \big\lVert\, q_s - C_s[\,b_s\,] \,\big\rVert²

This is asymmetric — the query stays full-precision while the database is quantized — which keeps far more accuracy than comparing two codes.

def adc_search(q, codes, codebooks, sub):
    # table[s][j] = ||q_s - centroid_j||^2  for each subspace
    table = [((q[s*sub:(s+1)*sub] - C) ** 2).sum(1)
             for s, C in enumerate(codebooks)]
    # each doc's distance = sum of m lookups
    return [sum(table[s][b] for s, b in enumerate(code)) for code in codes]

The subvector tradeoff

More subvectors means finer codes — higher recall, but a longer code. Measured live on the GloVe vectors above (k = 32 here; production uses k = 256):

Subvectors m	Sub-dim	Compression	recall@10
2	25	160×	≈ 0.78
5	10	64×	≈ 0.84
10	5	32×	≈ 0.88
25	2	12.8×	≈ 0.94

The striking number is the top row: even at 160× compression PQ keeps ~78% of the neighbors — where scalar binary at a mere 32× already fell to ~61%. Quantizing subspaces jointly is simply far more efficient than quantizing dimensions alone.

Scalar vs product quantization

Aspect

Scalar quantization

Product quantization

Quantizes

Each dimension

Each subvector (group of dims)

Codebook

Implicit (uniform levels)

Learned k-means per subspace

Typical compression

4–32×

10–100×

Recall at high compression

Drops sharply

Holds well

Dequantize, then dot product

ADC via lookup tables

The difference is geometric. Give both schemes the same bit budget — the same number of reconstruction points — and PQ's learned centroids beat the uniform scalar grid on real vectors, because k-means places those points where the data actually is:

Beyond plain PQ

OPQ (Optimized PQ) learns a rotation that aligns the subspaces with the codebooks before quantizing, recovering a few points of recall for free.
Residual / additive quantization quantizes the leftover error with further codebooks, pushing accuracy higher at a fixed code length.
IVF + PQ pairs PQ codes with a coarse partition so search only scores a fraction of the index — the configuration most FAISS deployments actually run, covered in IVF-PQ.

Best practices

Make m divide d. Subvectors must tile the dimension exactly; pad if needed.
Train on representative data. Codebooks are only as good as the distribution they were fit on — re-train when the corpus shifts.
Use OPQ when you can afford it. The rotation is a one-time cost for a consistent recall gain.
Re-rank the top candidates. Retrieve generously with PQ, then re-score the few finalists with full-precision vectors.

References

Jégou et al. "Product Quantization for Nearest Neighbor Search"
Ge et al. "Optimized Product Quantization"
Babenko & Lempitsky "Additive Quantization for Extreme Vector Compression"

Embeddings & Retrieval

Quantization Effects Simulator

Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.

Embeddings & Retrieval

Binary Embeddings for Fast Search

Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.

Embeddings & Retrieval

IVF-PQ: Inverted File with Product Quantization

Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.

Embeddings & Retrieval

HNSW vs IVF-PQ vs LSH: Approximate Nearest Neighbor Algorithms Compared

How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.

Embeddings & Retrieval

Dense Embeddings

How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.

Embeddings & Retrieval

HNSW: Hierarchical Navigable Small World

How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.