With One Example Learning Hierarchical Clustering

4 min readAug 31, 2022

Hierarchical Clustering is not really used much, if you use scikit-learn, the syntax is not much different from other models, except for the arguments to the constructor.

Although almost all sklearn models are always used as black boxes (after all, no one always builds a car company on wheels), it is important to find a way to remember such uncommon models, and after thinking about it, I thought I’d write this article alone to illustrate the algorithm with an example.

This example is the distance between a few Canadian cities, and then the “closest” distance to make a cluster. It was found in IBM’s skill network, and I’ll lay it out here to illustrate it visually.

Let’s say we want to cluster six cities in Canada based on their distances from one another. They are Toronto(TO), Ottawa(OT), Vancouver(VA), Montreal(MO), Winnipeg(WI), and Edmonton(ED).

We construct a distance matrix at this stage, where the numbers in the row i column j is the distance between the i and j cities.

dis(i , j)

In fact, this table shows the distances between each pair of cities. The algorithm is started by assigning each city to its own cluster. So if we have six cities, we have six clusters each containing just one city. Let’s note each city by showing the first two characters of its name.

The first step is to determine which cities, let’s call them clusters from now on, to merge into a cluster. Usually, we want to take the two closest clusters according to the chosen distance. Looking at the distance matrix, MO and OT are the closest clusters SO we make a cluster out of them,

Notice that we just use a simple one-dimensional distance feature here, but our object can be multidimensional and distance measurement can either be Euclidean, Pearson, average distance or many others depending on data type and domain knowledge.

Anyhow, we have to merge these two closest cities in the distance matrix as well.

So, rows and columns are merged as the cluster is constructed.

As you can see in the distance matrix, rows and columns related to MO and OT cities are merged as the cluster is constructed.

The distances from all cities to this new merged cluster get updated. But how? For example, how do we calculate the distance from WI to the [OT/MO] cluster? Well, there are different approaches but let’s assume for example, we just select the distance from the center (Ward linkage)of the [OT/MO] cluster to WI. Updating the distance matrix, we now have one less cluster.

We look for the closest clusters once again. In this case, [OT/MO], and TO are the closest ones which creates another cluster.

The closest distances between the VA cluster and the ED cluster. Forming a new cluster, the data in the matrix table gets updated.

Essentially, the rows and columns are merged as the clusters are merged and the distance updated. This is a common way to implement this type of clustering and has the benefit of caching distances between clusters. In the same way, agglomerative algorithm proceeds by merging clusters, and we repeat it until all clusters are merged and the tree becomes completed. It means, until all cities are clustered into a single cluster of size six. Hierarchical clustering is typically visualised as a dendrogram as shown on this slide.

Dendrogram:

Each merge is represented by a horizontal line.
The y-coordinate of the horizontal line is the similarity of the two clusters that were merged where cities are viewed as singleton clusters.

By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering.

Essentially, hierarchical clustering does not require a prespecified number of clusters. However, in some applications, we want a partition of disjoint clusters just as in flat clustering. In those cases, the hierarchy needs to be cut at some point. For example here, cutting in a specific level of similarity(threshold), we create three clusters of similar cities.

Hopefully you as reader, including me, will remember the algorithm from one example.

With One Example Learning Hierarchical Clustering

Written by TeeTracker

No responses yet