Intuition vs Unsupervised Learning – Agglomerative Clustering in practice

Clustering is a hugely important step of exploratory data analysis and finds plenty of great applications. Typically, clustering technique will identify different groups of observations among your data. For example, if you need to perform market segmentation, cluster analysis will help you with labeling each segment so that you can evaluate each segment’s potential and target some attractive segments. Therefore, your marketing program and positioning strategy rely heavily on the very fundamental step – grouping of your observations and creation of meaningful segments. We may also find many more use cases in computer science, biology, medicine or social science. However, it often turns out to be quite difficult to define properly how a well-separated cluster looks like.

Today, I will discuss some technical aspects of hierarchical cluster analysis, namely Agglomerative Clustering. One great advantage of this hierarchical approach would be fully automatic selection of the appropriate number of clusters. This is because in genuine unsupervised learning problem, we have no idea how many clusters we should look for! Also, in my view, this clever clustering technique solves some ambiguity issues regarding vague definition of a cluster and thus is more than suitable for automatic detection of such structures. On the other hand, the agglomerative clustering process employs standard metrics for clustering quality description. Therefore, it will be fairly easy to observe what is going on.

For the purpose of a brief demonstration, we are using a well-known Iris dataset. We are provided with three different types of observations and four continuous variables:

For now, imagine the iris type label is not available. Suddenly, it is intuitively much more difficult to detect some local structures:

We may be able to separate the observations into two clusters by guesswork. Nevertheless, we should definitely perform better than that. It may be another task for feature engineering to produce variables where all the clusters would be clearly separated, very distant and compact. Here, the question is whether we can automatically cluster these observations in order to obtain meaningful structure. Now, it may be a good time to utilize the agglomerative clustering technique!

In the first step, every observation is perceived as a separate cluster. We then merge two most similar/closest observations/clusters together. The process iteratively merges clusters and observations until we end up with one huge cluster containing all the observations. We are free to choose from several distance metrics (this needs to work in general for comparison of an observation with another observation, an observation with a cluster as well as for comparison of two clusters), but it is wise to use the one which minimizes some overall clustering quality measure. Therefore, we can perform agglomerative clustering using Ward’s linkage which helps to optimize Ward’s criterion (the sum of variances in each variable for each cluster). This criterion is designed to assess the quality of a resulting clustering. Ward’s linkage will merge such observations or clusters that resulting increase of Ward’s criterion is minimized. In our Python example, we use linkage function from scipy.cluster.hierarchy.

In the dendrogram figure above, you can see the observations and clusters merging on the horizontal axis. On the vertical axis, you can observe how distances (based on the selected measure; Ward’s linkage in this case) between the temporary clusters change when merging more and more observations together. Here comes the trick. In order to decide where to draw the line between two different clusters you need to look for INCONSISTENT increase in our metric. What does it mean?

Take a look at the green part of our dendrogram the left. This represents the set of observations forming one cluster. As you merge more and more observations, the increase in metric (vertical axis) is roughly the same. This fact is reflecting the idea that a cluster should be a compact group of observation with rather smooth density. Therefore, we should not encounter any bigger steps until we merge an observation from a different (true) cluster or we merge the two separate clusters. The first inconsistent vertical increase may be detected when the metric suddenly jumps twice as high as anything before:

The question is whether this sudden increase corresponds significantly to appropriate separation of two clusters. We could either stick with the higher level of detail or label everything as one cluster:

We shall take a look at the big picture first. Consider the red part of the dendrogram on the right. Once again, we observe gradually increasing metric with a few bigger steps towards the higher values. Finally, there is a large jump when trying to merge the green part with the red. Therefore, we shall clearly separate those two groups of observations. But what about the further separation? How many clusters should we look for?

The natural and easiest way how to approach this problem would be usage of elbow method with respect to Ward’s criterion optimization – this is why we strategically selected Ward’s criterion in a first place. The global minimum of Ward’s criterion would result in degenerated situation where each observation stands for one cluster. Obviously, that is not the desired result. Therefore, we utilize the elbow method on Ward’s linkage (although it would be more appropriate to show Ward’s criterion directly):

To sum it up, we see that clustering with two clusters decreases Ward’s criterion significantly. Three clusters (situated in the elbow) will also yield big decrease whereas subsequent clustering with higher number of clusters won’t get us much better value. Another solution may be also applied, for example defining inconsistency coefficient of metric increase. Nevertheless, clustering task is subjective by its very nature and each person may perceive different solutions individually. That is why the second step should always be checking the business profile of the resulting clusters and verifying their inner integrity.


Image Credits: Designed by Freepik

1 thought on “Intuition vs Unsupervised Learning – Agglomerative Clustering in practice

  1. Pingback: Distilled News | Data Analytics & R

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.