Member-only story

Machine learning basics (part 9): Clustering

5 min readMay 17, 2022

Clustering analysis is the grouping of individuals, elements or cases in a population or dataset to discover structure in the data. In some sense, we would like to have the cases within a group to be close or similar to one another, but dissimilar from cases in other groups.

Clustering is fundamentally a collection methods of data exploration. They are unsupervised learning algorithms, but can also be applied in the supervised way provided that there exists class label information. When they are used in the unsupervised way as usual, generated — if these emerge — groups called clusters are named and their ”meaning” defined by a user. Clusters can then be employed as classes for further supervised machine learning. Clustering produces a partition of the dataset — its division into mutually non-overlapping groups.

To compute clusters we apply distance or similarity measures in order to decide which cases are assigned in each cluster and how clusters are made. Dividing N data cases into K clusters (groups) gives a huge number of possible partitions, which is expressed in the form of Stirling number:

Partition-based or objective function-based clustering

Next the basic version of K-means algorithm is described, where the number K of clusters has to first be…

Machine learning basics (part 9): Clustering

Partition-based or objective function-based clustering

Written by Hang Nguyen

No responses yet