Imagine a storekeeper who keeps a record of all his customers’ purchase histories. This allows him to look up the type of products an individual buyer might be interested in. However, doing this for each individual is grossly inefficient. A better solution would be to categorize his customers into groups, with each group having similar preferences. This would allow him to reach out to more customers with each product recommendation.
The problem is, the storekeeper does not know 1) how his customers should be categorized, nor 2) how many of such categories exist. To answer these questions, we can use clustering.
k-means clustering is a technique used to uncover categories. In the retail sector, it can be used to categorize both products and customers. k represents the number of categories identified, with each category’s average (mean) characteristics being appreciably different from that of other categories.
In the storekeeper’s case, we can uncover categories of customers by combining personal information with purchase histories. This would allow us to group customers of similar backgrounds, who tend to buy similar products.
To illustrate this, we use an actual dataset of Facebook user personalities and ‘likes’. This data was obtained from Facebook users who completed a short personality questionnaire and provided information on which movie pages they had ‘liked’. In place of personality scores, we may also use demographic information such as age or household income.
Based on experience, we may have a hunch that different movie genres appeal to people of different personalities. To confirm this, we can construct a plot of movie titles along personality dimensions:
From initial inspection, there appears to be three clusters:
- Red: Conscientious extraverts who like action and romance genres
- Blue: Anxious and open people who like avant-garde and fantasy genres
- Yellow: Introverts with social anxieties who like Japanese animations (otaku culture)
- Movies in the center seem to be general household favorites.
With this information, the storekeeper can now be more certain of recommending products to interested customers. For instance, if a customer bought a DVD of 27 Dresses, the storekeeper could deduce that his customer is likely to be conscientious or extraverted, and might also be interested in another movie in the same cluster, such as 50 First Dates. Besides product recommendation, such clusters also allow the storekeeper to bundle similar products for effective discounts.
After seeing how clusters can be used, we will now examine how the technique works. Recall that clustering solves two problems:
- Determining the number of categories that exist
- Determining the members of each category
One way to find out the number of categories is by visual inspection, as with the above plot of movie titles.
Another way is to use something called a scree plot:
This plot shows how within-cluster scatter (i.e. how dispersed a cluster is) decreases as the number of clusters increase. With more clusters, cluster members can be closer to cluster centers, forming more compact clusters. Conversely, if there is only one cluster to which all members belong to, within-cluster scatter would be at its maximum.
The choice of number of clusters is based on the principle of diminishing marginal returns. If there are too many small clusters, the result may be overly-complex, such that categories are not generalizable to new products or customers. Hence, the scree plot reveals a “kink” at which the number of clusters derived can reduce within-cluster scatter to a reasonable degree, beyond which having any more clusters would yield smaller and yet smaller clusters. These smaller clusters come at an increasing cost of results being more complex and less generalizable.
After determining the number of clusters, we can then determine cluster membership. This involves a simple iterative process. We will illustrate this process with a 2-cluster example:
Step 4: Repeat the steps of re-assigning cluster members (Step 2) and re-locating cluster centers (Step 3), until there are no more changes to cluster membership. See the full iterative process in the animation below:
These 4 steps wrap up the process of determining cluster membership. The same process is used for 3 or more clusters. In addition, while the above example only shows data points varying along 2 dimensions (as 2-dimensional plots are easier to visualize), clustering can also be done for 3 or more dimensions. In other words, the storekeeper could cluster his customers by combining multiple sources of information in addition to their personality, such as their age, income and how far their home is from his store.
Apart from the retail sector, clustering is used in a wide range of fields. For example, clustering can help to identify biological genotypes, and to pinpoint hot spots of criminal activity.
While k-means clustering is a useful tool, it is not without limitations:
- Each data point can only be assigned to one cluster. Sometimes a data point might be in the middle of 2 clusters, having an equal chance of being assigned to either, but then a small shift might superfluously nudge it over to one of them. Hence, a more robust solution might include probability values, indicating how likely each data point might belong to each cluster.
- Clusters are assumed to be spherical. The distance from a cluster center to its furthest data point is akin to the cluster’s radius, and the iterative process of finding data points closest to the cluster center is akin to narrowing the cluster’s radius, so that resulting clusters are compact spheres. This might pose a problem if the shape of an actual cluster is, for instance, an ellipse. An elongated cluster might be truncated, and its members subsumed into a nearby cluster.
- Clusters are assumed to be discrete. k-means clustering does not permit clusters to overlap, nor to be nested within each other.
While there are alternative clustering algorithms that overcome these limitations, the strength of k-means clustering algorithm lies in its elegant simplicity. A good strategy might be to start with k-means clustering to get a basic understanding of the data structure, before diving into more advanced methods to examine areas where k-means clustering falls short.
R script used to simulate iterative process downloadable here.
Did you learn something useful today? We would be glad to inform you when we have new tutorials, so that your learning continues!
Sign up below to get bite-sized tutorials delivered to your inbox:
Copyright © 2015-Present Algobeans.com. All rights reserved. Be a cool bean.