Skip to main content

Vector Quantization Techniques

Summary
Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.

Scalar quantization shrinks each dimension independently and tops out around 32×. Product quantization (PQ) goes much further — 10–100× — by quantizing whole subspaces against learned codebooks, and it is the technique underneath billion-scale vector search engines like FAISS.

Interactive Quantization Explorer

The explorer is real: it learns k-means codebooks on genuine GloVe vectors, encodes each vector to a handful of centroid indices, reconstructs from those centroids, and measures recall@10 live. Change the subvector count to walk the compression–recall curve.

From scalar to product quantization

Scalar quantization asks each dimension "which of these levels are you closest to?" Product quantization asks a smarter question of groups of dimensions: split the vector into m subvectors and, for each, learn a codebook of k representative centroids with k-means. Encoding replaces each subvector with the index of its nearest centroid:

x ∈ ℝd \;→\; [\,b1, b2, \dots, bm\,] ∈ \{0, \dots, k-1\}m

So a d-float vector becomes m small integers — m·log₂k bits instead of 32d:

\text{compression} = 32\,dm \,log2 k

With the usual k = 256 (one byte per subvector) that is exactly 4d/m — a 768-d vector in 96 bytes at m = 8.

def train_pq(X, m, k=256): d = X.shape[1]; sub = d // m # one k-means codebook per subspace return [KMeans(k).fit(X[:, s*sub:(s+1)*sub]).cluster_centers_ for s in range(m)] def encode(x, codebooks, sub): # nearest centroid index in each subspace → m-byte code return [np.argmin(((x[s*sub:(s+1)*sub] - C) ** 2).sum(1)) for s, C in enumerate(codebooks)]

Asymmetric distance computation

The clever part is search. You never decompress the database. Instead, for a query q you precompute a small table of distances from each query subvector to all k centroids in that subspace, then a document's distance is just m table lookups summed:

d(q, x)2 \;≈\; Σs=1m \big\lVert\, qs - Cs[\,bs\,] \,\big\rVert2

This is asymmetric — the query stays full-precision while the database is quantized — which keeps far more accuracy than comparing two codes.

def adc_search(q, codes, codebooks, sub): # table[s][j] = ||q_s - centroid_j||^2 for each subspace table = [((q[s*sub:(s+1)*sub] - C) ** 2).sum(1) for s, C in enumerate(codebooks)] # each doc's distance = sum of m lookups return [sum(table[s][b] for s, b in enumerate(code)) for code in codes]

The subvector tradeoff

More subvectors means finer codes — higher recall, but a longer code. Measured live on the GloVe vectors above (k = 32 here; production uses k = 256):

Subvectors mSub-dimCompressionrecall@10
225160×≈ 0.78
51064×≈ 0.84
10532×≈ 0.88
25212.8×≈ 0.94

The striking number is the top row: even at 160× compression PQ keeps ~78% of the neighbors — where scalar binary at a mere 32× already fell to ~61%. Quantizing subspaces jointly is simply far more efficient than quantizing dimensions alone.

Scalar vs product quantization

Scalar quantization
Product quantization
Quantizes
Each dimension
Each subvector (group of dims)
Codebook
Implicit (uniform levels)
Learned k-means per subspace
Typical compression
4–32×
10–100×
Recall at high compression
Drops sharply
Holds well
Search
Dequantize, then dot product
ADC via lookup tables

The difference is geometric. Give both schemes the same bit budget — the same number of reconstruction points — and PQ's learned centroids beat the uniform scalar grid on real vectors, because k-means places those points where the data actually is:

Beyond plain PQ

  • OPQ (Optimized PQ) learns a rotation that aligns the subspaces with the codebooks before quantizing, recovering a few points of recall for free.
  • Residual / additive quantization quantizes the leftover error with further codebooks, pushing accuracy higher at a fixed code length.
  • IVF + PQ pairs PQ codes with a coarse partition so search only scores a fraction of the index — the configuration most FAISS deployments actually run, covered in IVF-PQ.

Best practices

  1. Make m divide d. Subvectors must tile the dimension exactly; pad if needed.
  2. Train on representative data. Codebooks are only as good as the distribution they were fit on — re-train when the corpus shifts.
  3. Use OPQ when you can afford it. The rotation is a one-time cost for a consistent recall gain.
  4. Re-rank the top candidates. Retrieve generously with PQ, then re-score the few finalists with full-precision vectors.

References

  • Jégou et al. "Product Quantization for Nearest Neighbor Search"
  • Ge et al. "Optimized Product Quantization"
  • Babenko & Lempitsky "Additive Quantization for Extreme Vector Compression"

If you found this explanation helpful, consider sharing it with others.

Mastodon