Introduction
K-means clustering is to find groups in the unlabeled data, with the number of groups represented by the variable $K$. The algorithm works iteratively to assign each data point to one of the $K$ groups based on the features that are provided and feature similarity.
Algorithm
Input: the number of clusters $K$, the data set (a collection of features for each data point).
The algorithm starts with initial estimates for the $K$ centroids, which can either be randomly generated or randomly selected from the data set. Then, it iterates between two steps:
- Data assignment
- Centroid update
Until a stopping criteria is met (i.e., no data points change clusters, or the maximum number of iterations is reached)
Step1 - Data assignment
Each centroid defines one of the cliusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance.
Step2 - Centroid update
In this step, the centroids are recomputed by taking the mean of all data points assigned to that centorid’s cluster.
$$
c_i = \frac{1}{|S_i|}\sum_{x_i\in S_i}x_i
$$
$S_i$: the set of data point assignments for each $i$th cluster centroid
$x_i$: a data point in $S_i$
Choosing K
To find the number of clusters in the data, we can run the K-means clustering algorithm for a range of K values and compare the results.