What is it?
In Machine Learning, clustering is an Unsupervised Learning task which focuses on learning patterns and structures of a dataset, grouping similar data points into groups, often called clusters.
Use cases
Clustering is a flexible task which can be used both as the only model of a project, or as a processing step for a later model.
-
Segmentation
Given the need to discover different segments or groups in a dataset, one could apply clustering and then use the output clusters to perform any later analysis. e.g. cluster customers and show different ads for each segment.
-
Data analysis
One could also apply clustering to the whole dataset, and then analyze each cluster separately and differently, depending on the objective.
-
Dimensionality Reduction
Datasets with too many features can be reduced using clustering techniques. This process is called Dimensionality Reduction.
-
Outlier / anomaly detection
Any data point which has no similarity to the others or doesn’t properly fit into any cluster, can be considered an Outlier. This method is commonly used to detect fraudulent instances.
-
Semi-supervised learning
When used with a badly labeled dataset, clustering can supplement another ML algorithm with labeled data (which are the actual clusters). This technique categorizes a Semi-Supervised Learning method.
Types of clustering
There are four categories of clustering algorithms, which each one differ in the method used to form the cluster:
-
Centroid-based clustering
Clusters are represented by a central reference vector which may not be part of the original dataset. All data points revolve around the cluster’s centroid. e.g. K-Means.
-
Hierarchical clustering
Data points are similar to closer data points, and different than data points which are further away. The cluster is defined by the maximum distance needed to connect data points. e.g. BIRCH Clustering and Agglomerative Clustering.
-
Distribution-based clustering
Built on statistical distributions models, which data points are clustered together if they belong to the same distribution. Tend to be complex models which can easily overfit. e.g. Gaussian Mixture Clustering
-
Density-based clustering
Clusters are created based on areas which have high density, and data points which are in sparse areas are considered noise, outliers, or border points. e.g. DBSCAN