Jensen-Shannon divergence for meaningful clustering

Standard clustering metrics like Silhouette Score or Davies-Bouldin validate geometry, not meaning. In unsupervised scenarios without ground truth, we rely on these proxies, assuming that spatial compactness equals semantic relevance.

Often, this assumption fails. You get a high Silhouette Score, but when stakeholders ask, “What makes Segment A unique?”, you can’t answer because feature distributions overlap. This is the Explainability Gap.

The logic: distribution over distance

In my practice, the robust solution is Jensen-Shannon Divergence (JSD). Unlike Euclidean distance, JSD measures the statistical divergence between probability distributions.

The logic is simple: For every cluster and every feature, we calculate the divergence between:

The distribution of the feature inside the cluster ($P$)
The distribution of the feature outside the cluster ($Q$)

Why not Kullback-Leibler (KL)? KL divergence is asymmetric ($KL(P||Q) \neq KL(Q||P)$) and explodes to infinity if distributions don’t overlap (which happens constantly in clustering). JSD solves this: it is symmetric and strictly bounded to $[0, \ln(2)] \approx [0, 0.693]$ with natural logarithm (or $[0, 1]$ with base-2 logarithm). This bounded nature makes it numerically stable for hyperparameter tuning, whereas KL divergence often explodes and breaks optimizers.

Conceptual implementation

A critical engineering detail here is the Size Penalty. Pure JSD is mathematically maximized by singleton clusters (a single point is perfectly unique). To prevent the metric from “gaming” the system and creating micro-clusters, we must explicitly penalize them.

The following implementation computes the weighted JSD score for a labeled dataframe:

import numpy as np
from scipy.spatial.distance import jensenshannon

def calculate_jsd_metric(df, labels, feature_cols, penalty_weight=0.1):
    unique_labels = np.unique(labels)
    total_score = 0

    for label in unique_labels:
        in_cluster = df[labels == label]
        out_cluster = df[labels != label]

        # 1. Calculate JSD per feature
        c_score = 0
        for f in feature_cols:
            # Fixed binning for stability (production may use adaptive strategies)
            p, bins = np.histogram(in_cluster[f], bins=20)
            p = p / p.sum()  # normalize to PMF
            q, _ = np.histogram(out_cluster[f], bins=bins)
            q = q / q.sum()  # normalize to PMF
            c_score += jensenshannon(p, q)

        # Average over all features
        # Note: Production variant may select top-k features by JSD
        # for more focused interpretation instead of averaging all
        avg_jsd = c_score / len(feature_cols)

        # 2. Apply Size Penalty (Crucial!)
        # Simplified exponential penalty (production uses log-based ratio)
        # Penalizes clusters that are too small to be meaningful
        n = len(in_cluster)
        penalty = 1.0 - (1.0 / (1.0 + np.exp(-penalty_weight * (n - 10))))

        total_score += avg_jsd * penalty

    return total_score / len(unique_labels)

Implementation advice

I use this metric in two distinct ways:

For Cluster Interpretability (Post-hoc): To explain why a cluster exists, I calculate JSD for all features. Instead of staring at centroids, I sort features by JSD. The top features define the cluster’s identity (e.g., “Segment B is distinct because of low ‘Recency’ and high ‘Frequency’“)
For Optimization Target: When standard grid search yields geometric blobs with no business value, I add JSD to the objective function: $\text{Score} = (1 - \alpha) \cdot \text{Silhouette} + \alpha \cdot \text{JSD}$ A weight of $\alpha \approx 0.3$ usually forces the algorithm to sacrifice some compactness for better distinguishability.

Interpretability is not exclusive to supervised ML/AI; it is equally vital in unsupervised learning. JSD bridges this gap by quantifying not just how far apart clusters are, but what characterizes them — turning abstract geometric blobs into meaningful, explainable segments.