Calinski-Harabasz Index: The Variance Ratio Criterion

How the Calinski-Harabasz index evaluates clustering quality by measuring the ratio of between-cluster to within-cluster variance — fast, intuitive, and ideal for k-selection with convex clusters.

Best viewed on desktop for optimal interactive experience

What Calinski-Harabasz Measures

The CH index (also called the Variance Ratio Criterion) answers a simple question: are your clusters compact and well-separated? It computes the ratio of between-cluster scatter to within-cluster scatter, normalized by degrees of freedom. Higher values mean better-defined clusters. Proposed by Calinski and Harabasz in 1974, it remains one of the most widely used internal clustering evaluation metrics.

Think of it as a signal-to-noise ratio for clustering. The “signal” is how far apart the cluster centers are from the overall data center (between-cluster variance). The “noise” is how spread out the points are within each cluster (within-cluster variance). A good clustering maximizes the signal relative to the noise.

Mathematical Definition

The CH index is defined as:

CH = SSB / (k - 1)SSW / (n - k)

where k is the number of clusters and n is the total number of data points.

Between-cluster scatter SSB measures how far cluster centroids are from the global centroid:

SSB = Σj=1k nj \| μj - μ \|2

Within-cluster scatter SSW measures how spread out points are within each cluster:

SSW = Σj=1k Σxi ∈ Cj \| xi - μj \|2

The (k-1) and (n-k) terms normalize for the number of clusters and samples, making CH comparable across different values of k.

Exploring CH Interactively

Switch between preset scenarios to see how cluster arrangement affects the CH index. Watch how SSB and SSW shift as clusters move closer, overlap, or take non-convex shapes.

CH Index Playground

Explore how cluster geometry affects the Calinski-Harabasz index.

882.2
SS_B (between)
34.7
SS_W (within)
1485.6
CH Index
Three tight, well-separated clusters produce a high CH index (1485.6). SS_B dominates because centroids are far from the global center, while SS_W is small because points cluster tightly.

How the Calculation Works

The computation follows a clear sequence. First, compute the global centroid from all data points. Then compute each cluster's centroid. SSB accumulates the squared distance from each cluster centroid to the global centroid, weighted by cluster size — larger clusters contribute more. SSW accumulates the squared distance from every point to its own cluster centroid. Finally, the ratio is normalized by degrees of freedom.

The degrees-of-freedom normalization is what makes CH fair across different values of k. Without it, increasing k would almost always increase SSB (more centroids spread further from the global center) and decrease SSW (smaller clusters are tighter). The (k-1) in the numerator and (n-k) in the denominator correct for this, penalizing unnecessary splits.

How CH Is Calculated

Step-by-step decomposition of the variance ratio on a simple 2-cluster example.

SS_B
Between-cluster
SS_W
Within-cluster
CH Index
Variance ratio

The degrees of freedom (k-1 and n-k) normalize for cluster count and sample size, making CH comparable across different k values. Without normalization, adding more clusters would always increase SS_B trivially.

Selecting k with CH

The most common use of CH is selecting the optimal number of clusters. Run the clustering algorithm for k = 2, 3, 4, \ldots, compute CH at each k, and pick the k that maximizes the score. The peak indicates where adding another cluster no longer improves the separation-to-spread ratio.

The (k-1) denominator naturally penalizes over-splitting — adding a cluster that doesn't meaningfully reduce SSW will decrease CH. This built-in regularization makes CH more robust than raw inertia for k-selection, where the “elbow” can be ambiguous.

Selecting k with CH Index

Click each k value to explore. CH peaks at the optimal cluster count.

Optimal
Current k
3
CH Score
452.7
Verdict
Optimal

CH peaks at k=3, matching the true cluster count. Adding more clusters beyond this point doesn’t meaningfully reduce within-cluster variance (SS_W), so the ratio drops.

Strengths and Limitations

CH has several practical advantages. It runs in O(n · k) time — no pairwise distance matrix needed, unlike Silhouette Score's O(n2). The signal-to-noise interpretation is intuitive and easy to explain to stakeholders. It works well for convex, globular clusters produced by algorithms like k-means and GMM. The degrees-of-freedom normalization makes it fair across different k values, so you can directly compare scores without additional correction.

However, CH has meaningful limitations. It assumes convex cluster shapes — the centroid of a crescent-shaped cluster lies in empty space, so SSW is inflated and SSB is misleading. It is sensitive to cluster size imbalance — a few small outlier clusters far from center can inflate SSB disproportionately. It cannot evaluate a single cluster (k = 1 is undefined because of the (k-1) denominator). And it does not provide per-point diagnostics — you get one number for the entire clustering, with no way to identify which individual points are poorly assigned.

Comparing Clustering Metrics

CH is one of several internal evaluation metrics. Each makes different geometric assumptions.

Comparing Clustering Metrics

Calinski-Harabasz
FormulaSS_B / SS_W (normalized)
Better WhenHigher
Range[0, ∞)
ComplexityO(n·k)
Convexity BiasAssumes convex
Best ForFast k-selection with k-means
Silhouette Score
Formula(b - a) / max(a, b)
Better WhenHigher
Range[-1, 1]
ComplexityO(n²)
Convexity BiasShape-agnostic
Best ForDiagnosing individual point assignments
Davies-Bouldin
Formulaavg max (σi+σj)/dij
Better WhenLower
Range[0, ∞)
ComplexityO(n·k)
Convexity BiasAssumes convex
Best ForWorst-case cluster overlap detection
Use CH when...
  • - You need fast computation (O(n\u00B7k))
  • - Working with convex, globular clusters (k-means, GMM)
  • - Comparing different k values on the same dataset
Consider alternatives when...
  • - Clusters are non-convex (crescents, rings) — use Silhouette
  • - You need per-point diagnostics — use Silhouette
  • - You want worst-case analysis — use Davies-Bouldin

Key Takeaways

  1. CH = between-cluster variance / within-cluster variance — higher means better-separated, more compact clusters. It is a signal-to-noise ratio for clustering quality.
  2. Use CH for fast k-selection — compute CH at each k, pick the peak. O(n · k) complexity makes it practical for large datasets where Silhouette's O(n2) is prohibitive.
  3. Beware of non-convex clusters — CH uses centroids, which misrepresent the geometry of crescents, rings, or other non-convex shapes. Use Silhouette Score for arbitrary geometries.
  4. Degrees of freedom matter — the (k-1) and (n-k) normalization prevents trivial score inflation from adding empty clusters, making CH a fair metric for comparing different k values.

If you found this explanation helpful, consider sharing it with others.

Mastodon