Introduction

K-means clustering is to find groups in the unlabeled data, with the number of groups represented by the variable $K$. The algorithm works iteratively to assign each data point to one of the $K$ groups based on the features that are provided and feature similarity.

Algorithm

Input: the number of clusters $K$, the data set (a collection of features for each data point).

The algorithm starts with initial estimates for the $K$ centroids, which can either be randomly generated or randomly selected from the data set. Then, it iterates between two steps:

Data assignment
Centroid update

Until a stopping criteria is met (i.e., no data points change clusters, or the maximum number of iterations is reached)

Step1 - Data assignment

Each centroid defines one of the cliusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance.

Step2 - Centroid update

In this step, the centroids are recomputed by taking the mean of all data points assigned to that centorid’s cluster.
$$
c_i = \frac{1}{|S_i|}\sum_{x_i\in S_i}x_i
$$

$S_i$: the set of data point assignments for each $i$th cluster centroid

$x_i$: a data point in $S_i$

Choosing K

To find the number of clusters in the data, we can run the K-means clustering algorithm for a range of K values and compare the results.

Reference

Introduction to K-means Clustering