Member-only story
Machine learning basics (part 9): Clustering
Clustering analysis is the grouping of individuals, elements or cases in a population or dataset to discover structure in the data. In some sense, we would like to have the cases within a group to be close or similar to one another, but dissimilar from cases in other groups.
Clustering is fundamentally a collection methods of data exploration. They are unsupervised learning algorithms, but can also be applied in the supervised way provided that there exists class label information. When they are used in the unsupervised way as usual, generated — if these emerge — groups called clusters are named and their ”meaning” defined by a user. Clusters can then be employed as classes for further supervised machine learning. Clustering produces a partition of the dataset — its division into mutually non-overlapping groups.
To compute clusters we apply distance or similarity measures in order to decide which cases are assigned in each cluster and how clusters are made. Dividing N data cases into K clusters (groups) gives a huge number of possible partitions, which is expressed in the form of Stirling number:
Partition-based or objective function-based clustering
Next the basic version of K-means algorithm is described, where the number K of clusters has to first be…