Introduction to Clustering

AI Maverick
4 min readJun 20, 2022

--

The main category of the Machine Learning Models is included supervised and unsupervised learning. In simple words, if we have the target variable, the problem is called supervised learning, in contrast, if the target variable is not available, we have unsupervised learning.

One of the unsupervised learning methods is clustering. In this article, we review different clustering approaches.

Clustering

The clustering history goes back to the days when humanity needed to describe the characteristics of different things. The early studies in this area were in the 50s.

The clustering has been used in different areas, including math, statistics, biology, Genetic science, and the Global literature.

In general, the clustering is used for one of the following reasons;

  • finding the hidden relations in the data structure to have insight from the data, having assumptions, identifying the unusual behavior of the data, and extracting the highlighted features.
  • Classifying based on the realistic similarities to measure the similarity degree among the cluster’s members.
  • Compressing the data to manage the data and summarize them in the representers of the clusters.

Similarity and distance measures

As we are using clustering to group different items, we need to use a series of metrics to measure the similarity or dissimilarity between the case studies. There are two well-known metrics in this field, similarity, and distance.

Most of the clustering methods use distance to measure the current similarity degree between the items.

On the other hand, there is a similarity between the two vectors as a metric. The higher value for this metric indicates higher similarity.

Clustering approaches

In the following, I list different approaches for clustering. There are different techniques in Clustering with various techniques.

  • Hierarchical clustering
  • Partitioning method
  • Density-based methods
  • Model-Based methods
  • Grid-Based methods

In the following, each approach is introduced completely.

Hierarchical clustering

In hierarchical strategy, clusters form from a sequential grouping approach (Top-to-bottom or vice versa) and include the following methods.

  • Agglomerative

Each data point represents one cluster that is the only member of the cluster, and in the forward approach, the clusters merge until the proper structure of the cluster form. This algorithm considers similarity as the clustering metric.

  • Divisive clustering

In this method, unlike the previous one, all the data points allocate to one cluster, and in the further steps (iterations), the cluster is divided into the sub-clusters.

Partitioning method

In this scenario, each cluster, based on the minimum value of the distance changes. It needs an initial value as the first solution, and the cluster numbers are defined as the model hyper-parameter. To achieve the optimal solution, which is equal to the ideal allocation, all the possible solutions should be considered. Partitioning includes the following methods.

  • Sum of squared errors

This is one of the common methods of clustering and works better for compressed and separated clusters. The idea is to find the structure of the clusters that minimize the Sum of Squared Error.

  • Graph-based clustering

It uses the data points graphs angles as a connected node.

Density-based methods

The clusters are defined with high-density zones. These algorithms try to identify the connected zones with the highest density.

Grid-Based methods

It generates grids on the problem space and applies the clustering methods to each cell.

Evaluation criteria

One of the essential subjects in clustering is the quality of the clusters. There are some developed criteria for clustering and divided into two groups (Internal evaluation and External evaluation)

Internal evaluation

This group includes the Compactness of the clusters and analysis it with the similarity metric. This criterion examines the closeness degrees of the clusters’ members (Intra-cluster Homogeneity), The difference between members of two clusters (Inter-cluster Separability), or both of them. In the following, SSE or Sum of Squared Error is one of these criteria.

K-means

In the following, I introduce the maths behind the K-Means clustering algorithm. Assume that we have Y = {y1, y2, …, ym} with d-dimensions which need to be clustered in K clusters C = {C1, C2, …, Ck}. The K-Means Algorithm tries to minimize within-cluster sum-of-squares. If we consider $\Mu$ as the mean of the cluster $c_k$, then the within-cluster sum-of-squares would be

This optimization would be an NP-Hard problem as the K-Means is a greedy algorithm and achieves the Local optimum. But if the clusters are separated properly, they probably have the.

Conclusion

In this review, I introduced different approaches for clustering and their difference. Moreover, we reviewed the evaluation metrics and criteria for each algorithm. As an example, we introduced the K-Means model briefly.

--

--

No responses yet