Introduction to Clustering
The main category of the Machine Learning Models is included supervised and unsupervised learning. In simple words, if we have the target variable, the problem is called supervised learning, in contrast, if the target variable is not available, we have unsupervised learning.
One of the unsupervised learning methods is clustering. In this article, we review different clustering approaches.
Clustering
The clustering history goes back to the days when humanity needed to describe the characteristics of different things. The early studies in this area were in the 50s.
The clustering has been used in different areas, including math, statistics, biology, Genetic science, and the Global literature.
In general, the clustering is used for one of the following reasons;
- finding the hidden relations in the data structure to have insight from the data, having assumptions, identifying the unusual behavior of the data, and extracting the highlighted features.
- Classifying based on the realistic similarities to measure the similarity degree among the cluster’s members.
- Compressing the data to manage the data and summarize them in the representers of the clusters.
Similarity and distance measures
As we are using clustering to group different items, we need to use a series of metrics to measure the similarity or dissimilarity between the case studies. There are two well-known metrics in this field, similarity, and distance.
Most of the clustering methods use distance to measure the current similarity degree between the items.
On the other hand, there is a similarity between the two vectors as a metric. The higher value for this metric indicates higher similarity.
Clustering approaches
In the following, I list different approaches for clustering. There are different techniques in Clustering with various techniques.
- Hierarchical clustering
- Partitioning method
- Density-based methods
- Model-Based methods
- Grid-Based methods
In the following, each approach is introduced completely.
Hierarchical clustering
In hierarchical strategy, clusters form from a sequential grouping approach (Top-to-bottom or vice versa) and include the following methods.
- Agglomerative
Each data point represents one cluster that is the only member of the cluster, and in the forward approach, the clusters merge until the proper structure of the cluster form. This algorithm considers similarity as the clustering metric.
- Divisive clustering
In this method, unlike the previous one, all the data points allocate to one cluster, and in the further steps (iterations), the cluster is divided into the sub-clusters.
Partitioning method
In this scenario, each cluster, based on the minimum value of the distance changes. It needs an initial value as the first solution, and the cluster numbers are defined as the model hyper-parameter. To achieve the optimal solution, which is equal to the ideal allocation, all the possible solutions should be considered. Partitioning includes the following methods.
- Sum of squared errors
This is one of the common methods of clustering and works better for compressed and separated clusters. The idea is to find the structure of the clusters that minimize the Sum of Squared Error.
- Graph-based clustering
It uses the data points graphs angles as a connected node.
Density-based methods
The clusters are defined with high-density zones. These algorithms try to identify the connected zones with the highest density.
Grid-Based methods
It generates grids on the problem space and applies the clustering methods to each cell.
Evaluation criteria
One of the essential subjects in clustering is the quality of the clusters. There are some developed criteria for clustering and divided into two groups (Internal evaluation and External evaluation)
Internal evaluation
This group includes the Compactness of the clusters and analysis it with the similarity metric. This criterion examines the closeness degrees of the clusters’ members (Intra-cluster Homogeneity), The difference between members of two clusters (Inter-cluster Separability), or both of them. In the following, SSE or Sum of Squared Error is one of these criteria.
K-means
In the following, I introduce the maths behind the K-Means clustering algorithm. Assume that we have Y = {y1, y2, …, ym} with d-dimensions which need to be clustered in K clusters C = {C1, C2, …, Ck}. The K-Means Algorithm tries to minimize within-cluster sum-of-squares. If we consider $\Mu$ as the mean of the cluster $c_k$, then the within-cluster sum-of-squares would be
This optimization would be an NP-Hard problem as the K-Means is a greedy algorithm and achieves the Local optimum. But if the clusters are separated properly, they probably have the.
Conclusion
In this review, I introduced different approaches for clustering and their difference. Moreover, we reviewed the evaluation metrics and criteria for each algorithm. As an example, we introduced the K-Means model briefly.