Analytical Market Segmentation with t-SNE and Clustering Pipeline

Irrespective of whether the underlying data comes from e-shop customers, your clients, small businesses or both large profit and non-profit organizations, market segmentation analysis always brings valuable insights and helps you to leverage otherwise hidden information in your favor, for example greater sales. Therefore, it is vitally important to utilize an efficient analytical pipeline, which would not only help you understand your customer base, but also further serve you during planning of your tailored offers, advertising, promos or strategy.  Let us play with some advanced analytics in order to provide a simple example of efficiency improvement when using segmentation techniques, namely clusteringprojection pursuit and t-SNE.

As your goal might be improving your sales through tailored customer contact, you need to discover homogeneous groups of people. The different groups of customers behave and respond differently, therefore it is only natural to treat them in a different way. The idea is to get greater profit in each segment separately, through diverse strategy. Thus, we need to accomplish two fundamental tasks:

  1. identify homogeneous market segments (i.e. which people are in which group)
  2. identify important features (i.e. what is decisive for customer behavior)

In this post, I am focusing on the first problem from the technical point of view, using some advanced analytic methods. For the sake of brief demonstration, I will work with simple dataset, describing the annual spending of clients of a wholesale distributor on diverse product categories. Following the figure below, it would be difficult to detect some well separated clusters of clients at the first sight.

scatter_all

This difficulty arises partly from the fact that our data are multidimensional and appropriate  structures may not be obvious when observing only from two dimensions.  A lot of manual effort may be easily wasted when analyzing all the combinations of potential 2D or even 3D scatter plots. Luckily, we have some machine learning methods from unsupervised learning  to deal with this particular problem.

On the one hand, I would like to mention well-established approach like clustering or projection pursuit. On the other hand, we should also take a look at cutting-edge visualization techniques like t-SNE. All of these methods may contribute greatly to efficient segmentation and subsequent increased profits.

In our dataset, we are also provided with binary variable indicating sales Channel (Hotel/restaurant/cafe vs Retail). Imagine we are not provided with this target variable. Still, we would like to divide our customers into two large segments as the channel clearly stands for some causal connection with those spending categories. Let us briefly compare (just) three selected methods and see which approach was able to discover the most from the hidden information about the channel involved.

First, we try to perform Principal Component Analysis (PCA) in order to reduce dimensionality of the multidimensional feature space. The picture below shows the projection of our data on two main components (i.e. with the highest variance) using PCA. However, this wasn’t much useful because we are still not able to detect any well-separated major structures. If we did not possess any information about the channel label (in the middle), we would not be able to draw a line separating the green and blue dots (on the left). Nevertheless, PCA provides us with maximal variance components (which maximizes information), that is why this dimensionality reduction technique might be useful in general. Here is the result: (we also applied scaling)

pca_vs_kmeans

Secondly, we applied very simple k-means clustering algorithm to our multidimensional data in attempt to discover a hidden pattern (on the right). We can see it did a pretty sound job as we manually set the number of clusters to search for to two. Centroids of each cluster roughly correspond to the respective means of each channel. However, only about 77 % of all clients are correctly assigned to their true channel segment. (Of course, we would not be able to measure that in practice with unknown channel label.) That may seem like quite a good baseline, but keep in mind – the better segmentation efficiency, the higher potential profit from diverse strategies because you hit the right target more often. How can we improve our clustering efficiency then?

We utilize a modern and very popular visualization method called t-Distributed Stochastic Neighbor Embedding (t-SNE). It has a supreme ability to find structures and relevant connections among high-dimensional data. t-SNE maps the dataset in a two-dimensional plane, which is extremely suitable for any kind of visualization of course. t-SNE brings similar clients (in feature space) close together on a final 2D map. Although we can not make any feature-based conclusions from the final 2D map, we can definitely detect local structures, i.e. client segments. Why not to create a pipeline that can be used to chain multiple clustering algorithms and dimensionality reduction techniques together? In the figure below, you shall see the result of k-means clustering applied to two-dimensional t-SNE diagram. The matching efficiency increased by 10 %, so that we have about 90 % of all clients correctly assigned to their true channel segment. And all of this without any tuning of methods used at all!

tsne_kmeans

t-SNE has a non-convex objective function which is minimized by gradient descent with random initialization. Thus, we want to run t-SNE several times and choose the mapping with the lowest possible value of the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data (i.e. our objective function). In the sense of clustering, that may yield us the most efficient segmentation.  We briefly verify both this trend and the consistency of our maximal achieved efficiency around 90 % by running the whole pipeline several hundred times:

eff

To sum up, it is definitely worth to utilize different clustering methods in order to perform analytical market segmentation. And what is more, we showed you an example how can you boost the clustering efficiency by chaining some more methods into a pipeline. Remember, we have done this because the more accurate segmentation, the higher profits in each segment!

Stay tuned for the online aLook Analytics Segmentation Demo, coming soon!

Advertisements

4 thoughts on “Analytical Market Segmentation with t-SNE and Clustering Pipeline

  1. Dhruv

    How do you handle large data set while running tSNE in your pipeline? Since Scikit’s implementation of tSNE crashes for my large training dataset.

    Like

    Reply
    1. petrbour Post author

      I highly recommend to check t-SNE in Scikit.
      It may be good idea to run t-SNE algorithm several times in order to optimize KL divergence metric. In my implementation, the whole process is governed by class. For t-Sne projection methods, you may use something simple like this:

      def _tsne(self, dim=2, rnd_seed=12345, num_proj_opt=10):
          tsne = TSNE(n_components=dim, random_state=rnd_seed)
          best_params = self._metric_optimization(est=tsne,
                                                  metric='kl_divergence_',
                                                  num_attempts=num_proj_opt,
                                                  rnd_seed=rnd_seed)
          tsne.set_params(**best_params)
          projection = tsne.fit_transform(self.X)
          return projection
      
      def _metric_optimization(self, est, metric, num_attempts=10, rnd_seed=12345):
          best_params = est.fit(self.X).get_params()
          best_val = est.__dict__[metric]
          random.seed(rnd_seed)
          attempts = sample(range(10000), k=num_attempts)
      
          for seed in attempts:
              est.set_params(**{'random_state': seed})
              est.fit(self.X)
              if est.__dict__[metric] < best_val:
                  best_params = est.get_params()
          return best_params
      

      Multidimensional data are stored as self.X. The desired t-SNE output is projection. And as you may notice, the method _metric_optimization is rather generic.

      Like

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s