What Calinski-Harabasz Measures
The CH index (also called the Variance Ratio Criterion) answers a simple question: are your clusters compact and well-separated? It computes the ratio of between-cluster scatter to within-cluster scatter, normalized by degrees of freedom. Higher values mean better-defined clusters. Proposed by Calinski and Harabasz in 1974, it remains one of the most widely used internal clustering evaluation metrics.
Think of it as a signal-to-noise ratio for clustering. The “signal” is how far apart the cluster centers are from the overall data center (between-cluster variance). The “noise” is how spread out the points are within each cluster (within-cluster variance). A good clustering maximizes the signal relative to the noise.
Mathematical Definition
The CH index is defined as:
where k is the number of clusters and n is the total number of data points.
Between-cluster scatter SSB measures how far cluster centroids are from the global centroid:
Within-cluster scatter SSW measures how spread out points are within each cluster:
The (k-1) and (n-k) terms normalize for the number of clusters and samples, making CH comparable across different values of k.
Exploring CH Interactively
Switch between preset scenarios to see how cluster arrangement affects the CH index. Watch how SSB and SSW shift as clusters move closer, overlap, or take non-convex shapes.
CH Index Playground
Explore how cluster geometry affects the Calinski-Harabasz index.
How the Calculation Works
The computation follows a clear sequence. First, compute the global centroid from all data points. Then compute each cluster's centroid. SSB accumulates the squared distance from each cluster centroid to the global centroid, weighted by cluster size — larger clusters contribute more. SSW accumulates the squared distance from every point to its own cluster centroid. Finally, the ratio is normalized by degrees of freedom.
The degrees-of-freedom normalization is what makes CH fair across different values of k. Without it, increasing k would almost always increase SSB (more centroids spread further from the global center) and decrease SSW (smaller clusters are tighter). The (k-1) in the numerator and (n-k) in the denominator correct for this, penalizing unnecessary splits.
How CH Is Calculated
Step-by-step decomposition of the variance ratio on a simple 2-cluster example.
The degrees of freedom (k-1 and n-k) normalize for cluster count and sample size, making CH comparable across different k values. Without normalization, adding more clusters would always increase SS_B trivially.
Selecting k with CH
The most common use of CH is selecting the optimal number of clusters. Run the clustering algorithm for k = 2, 3, 4, \ldots, compute CH at each k, and pick the k that maximizes the score. The peak indicates where adding another cluster no longer improves the separation-to-spread ratio.
The (k-1) denominator naturally penalizes over-splitting — adding a cluster that doesn't meaningfully reduce SSW will decrease CH. This built-in regularization makes CH more robust than raw inertia for k-selection, where the “elbow” can be ambiguous.
Selecting k with CH Index
Click each k value to explore. CH peaks at the optimal cluster count.
CH peaks at k=3, matching the true cluster count. Adding more clusters beyond this point doesn’t meaningfully reduce within-cluster variance (SS_W), so the ratio drops.
Strengths and Limitations
CH has several practical advantages. It runs in O(n · k) time — no pairwise distance matrix needed, unlike Silhouette Score's O(n2). The signal-to-noise interpretation is intuitive and easy to explain to stakeholders. It works well for convex, globular clusters produced by algorithms like k-means and GMM. The degrees-of-freedom normalization makes it fair across different k values, so you can directly compare scores without additional correction.
However, CH has meaningful limitations. It assumes convex cluster shapes — the centroid of a crescent-shaped cluster lies in empty space, so SSW is inflated and SSB is misleading. It is sensitive to cluster size imbalance — a few small outlier clusters far from center can inflate SSB disproportionately. It cannot evaluate a single cluster (k = 1 is undefined because of the (k-1) denominator). And it does not provide per-point diagnostics — you get one number for the entire clustering, with no way to identify which individual points are poorly assigned.
Comparing Clustering Metrics
CH is one of several internal evaluation metrics. Each makes different geometric assumptions.
Comparing Clustering Metrics
| Property | Calinski-Harabasz | Silhouette Score | Davies-Bouldin |
|---|---|---|---|
| Formula | SS_B / SS_W (normalized) | (b - a) / max(a, b) | avg max (σi+σj)/dij |
| Better When | Higher | Higher | Lower |
| Range | [0, ∞) | [-1, 1] | [0, ∞) |
| Complexity | O(n·k) | O(n²) | O(n·k) |
| Convexity Bias | Assumes convex | Shape-agnostic | Assumes convex |
| Best For | Fast k-selection with k-means | Diagnosing individual point assignments | Worst-case cluster overlap detection |
Calinski-Harabasz
Silhouette Score
Davies-Bouldin
Use CH when...
- - You need fast computation (O(n\u00B7k))
- - Working with convex, globular clusters (k-means, GMM)
- - Comparing different k values on the same dataset
Consider alternatives when...
- - Clusters are non-convex (crescents, rings) — use Silhouette
- - You need per-point diagnostics — use Silhouette
- - You want worst-case analysis — use Davies-Bouldin
Key Takeaways
- CH = between-cluster variance / within-cluster variance — higher means better-separated, more compact clusters. It is a signal-to-noise ratio for clustering quality.
- Use CH for fast k-selection — compute CH at each k, pick the peak. O(n · k) complexity makes it practical for large datasets where Silhouette's O(n2) is prohibitive.
- Beware of non-convex clusters — CH uses centroids, which misrepresent the geometry of crescents, rings, or other non-convex shapes. Use Silhouette Score for arbitrary geometries.
- Degrees of freedom matter — the (k-1) and (n-k) normalization prevents trivial score inflation from adding empty clusters, making CH a fair metric for comparing different k values.
Related Concepts
- Silhouette Score — Per-point evaluation metric that works with arbitrary cluster shapes
- Davies-Bouldin Index — Worst-case cluster similarity analysis with O(n·k) complexity
