Silhouette Score: Per-Point Clustering Evaluation

What Silhouette Score Measures

The silhouette score answers a per-point question: is this data point closer to its own cluster or to the nearest neighboring cluster? Unlike the Calinski-Harabasz index, which produces one number for the entire clustering, the silhouette score computes a value for every individual point, revealing exactly where cluster assignments are strong and where they break down. Proposed by Peter Rousseeuw in 1987, it remains one of the most widely used internal clustering evaluation metrics precisely because of this granularity.

Each point's score ranges from −1 to +1. A score near +1 means the point is deep inside its cluster, far from any neighbor. Near 0 means it sits on the boundary between two clusters. Below 0 means the point is likely assigned to the wrong cluster — it is actually closer to a different cluster on average. This per-point granularity is the silhouette's defining advantage.

Mathematical Definition

For a point x_i in cluster C_k:

a(i) = 1|C_k| - 1 Σ_{x_j ∈ C_k, j ≠ i} d(x_i, x_j)

This is the average distance to all other points in the same cluster — the intra-cluster distance.

b(i) = min_{C_l ≠ C_k} 1|C_l| Σ_{x_j ∈ C_l} d(x_i, x_j)

This is the average distance to points in the nearest other cluster — the nearest-cluster distance.

The silhouette score for point x_i is:

s(i) = b(i) - a(i)max(a(i), b(i))

The overall silhouette score is the mean across all points: S = 1n Σ_i=1ⁿ s(i).

Exploring Silhouette Interactively

Each point is colored by its silhouette score — teal for confident assignments, red for potential misclassifications. Switch between presets to see how cluster geometry affects per-point scores.

Anatomy of a Single Point's Score

The power of the silhouette lies in its per-point decomposition. For any point, you can trace exactly why it scores well or poorly by examining its two distances. A core point deep inside a tight cluster will have small a (nearby same-cluster neighbors) and large b (distant other-cluster points), yielding s close to 1. A boundary point will have similar a and b, yielding s near 0. A misclassified point will have a > b — it is closer to the wrong cluster.

This decomposition makes silhouette uniquely useful for debugging. When a clustering produces a mediocre average score, you can inspect the worst-scoring points to understand whether the problem is boundary ambiguity, cluster overlap, or outright misassignment. No other standard internal metric provides this level of diagnostic detail.

Silhouette Plots and k-Selection

The silhouette plot is the canonical visualization for this metric. Points are sorted by score within each cluster, producing “knife shapes.” A good clustering shows wide, uniform knives — all clusters have consistently high scores. A poor clustering shows thin, jagged knives with negative tails. The average silhouette across k values identifies the optimal cluster count, but the plot's shape matters as much as the number — uniform widths across clusters indicate balanced, well-separated groups.

When using silhouette for k-selection, look beyond the average. A clustering with k = 3 and average silhouette 0.55 where all clusters score uniformly is often preferable to k = 2 with average 0.60 where one cluster scores 0.85 and the other scores 0.35. The plot reveals this imbalance immediately, while the average alone would mislead you into choosing fewer clusters.

Strengths and Limitations

The silhouette score has several meaningful advantages. It works with arbitrary cluster shapes because it relies on pairwise distances rather than centroids — no assumption about convexity or globular geometry. Its per-point scores enable diagnosis of specific misassignments, something no other standard internal metric provides. The bounded [-1, +1] range is immediately interpretable without context. And it works with any distance metric — Euclidean, cosine, Manhattan, or custom domain-specific distances.

However, the silhouette has real limitations. Its O(n²) computation from pairwise distances makes it impractical for datasets above roughly 50,000 points without approximation or sampling. The average silhouette can mask problems — one excellent cluster and one terrible cluster might average to a “decent” score. It does not account for density differences between clusters, so a sparse cluster far from a dense cluster may score well despite poor internal cohesion. And for very high-dimensional data, distance concentration effects can compress the score range, reducing its discriminative power.

Comparing Clustering Metrics

Each internal clustering metric makes different geometric assumptions and trades off speed against diagnostic depth.

Comparing Clustering Metrics

Property	Calinski-Harabasz	Silhouette Score	Davies-Bouldin
Formula	SS_B / SS_W (normalized)	(b - a) / max(a, b)	avg max (σi+σj)/dij
Better When	Higher	Higher	Lower
Range	[0, ∞)	[-1, 1]	[0, ∞)
Complexity	O(n·k)	O(n²)	O(n·k)
Convexity Bias	Assumes convex	Shape-agnostic	Assumes convex
Best For	Fast k-selection with k-means	Diagnosing individual point assignments	Worst-case cluster overlap detection

Calinski-Harabasz

FormulaSS_B / SS_W (normalized)

Better WhenHigher

Range[0, ∞)

ComplexityO(n·k)

Convexity BiasAssumes convex

Best ForFast k-selection with k-means

Silhouette Score

Formula(b - a) / max(a, b)

Better WhenHigher

Range[-1, 1]

ComplexityO(n²)

Convexity BiasShape-agnostic

Best ForDiagnosing individual point assignments

Davies-Bouldin

Formulaavg max (σi+σj)/dij

Better WhenLower

Range[0, ∞)

ComplexityO(n·k)

Convexity BiasAssumes convex

Best ForWorst-case cluster overlap detection

Use Silhouette when...

- You need per-point diagnostics (which points are misassigned?)
- Clusters may be non-convex (crescents, rings, arbitrary shapes)
- You need a bounded, interpretable score (-1 to +1)

Consider alternatives when...

- Dataset is large (O(n²) becomes prohibitive) — use CH instead
- You only need to compare k values, not diagnose points — use CH
- You want cluster-level worst-case analysis — use Davies-Bouldin

Key Takeaways

Silhouette measures per-point fit — s(i) = (b - a) / max(a, b), where a is intra-cluster distance and b is nearest-cluster distance. Scores range from −1 (wrong cluster) to +1 (perfect fit).
Silhouette plots reveal cluster quality visually — wide, uniform “knife shapes” indicate well-separated clusters. Negative tails signal misassigned boundary points.
Shape-agnostic but O(n²) — unlike centroid-based metrics (CH, DB), silhouette works for arbitrary geometries. But pairwise distance computation limits scalability to moderate dataset sizes.
Per-point diagnostics are the key advantage — no other standard clustering metric tells you which specific points are problematic. Use silhouette when you need to understand why a clustering fails, not just whether it does.

Calinski-Harabasz Index — Fast variance-ratio metric for convex clusters
Davies-Bouldin Index — Worst-case cluster similarity analysis