What is clustering?
Clustering is to find patterns in the data points and group them into different clusters.
There are two broad approaches for clustering:
- Compactness: Points that lie close to each other fall in the same cluster and are compact around the cluster center. E.g: K-Means Clustering
- Connectivity: Points that are connected or immediately next to each other are put in the same cluster. E.g.: Spectual clustering
Spectral Clustering
Spectral clustering involves 3 steps:
- Compute a similarity graph
- Project the data onto a low-dimensional space
- Create clusters
Step1 - Compute a similarity graph
We first create an undirected graph $G=(V,E)$ either by:
The $\epsilon$-neigborhood graph : All points whose pairwise distances are smaller than $\epsilon$ are connected. The $\epsilon$-neighborhood graph is usually unweighted for the distances between all connected points are roughly of the same scale (at most $\epsilon$)
KNN Graph: Vertex $v_i$ is connected with $v_j$ if $v_j$ is among the k-nearest neighborhoods of $v_i$.
However, the nearest neighbors are not symmetric, since it is not necessary that $v_i$ is a nearest neighbor of $v_j$ when $v_i$ has $v_j$ as a nearest neighbor. Thus, we end up getting a directed graph.
To make it undirected, we can either simply ignore the directions od the edges to construct a k-nearest neighbor graph, or only connect two vertices when they are both among the k-nearest neighbors of each other to create a mutual k-nearest neighbor graph.
In both case, the edges are weighted by the similarity of the adjacent points.
Fully connected graph: Connect all points with each other, and weight all edges by similarity $s_{i,j}$
Step2 - Project the data onto a low-dimensional space
Given the observation that points in the same cluster may also be farther away than points in different cluster, our goal is to transform the space so that when the 2 points are close, they are always in same cluster, and when they are far apart, they are in different clusters.
For this, we computer the Graph Laplacian L:
$$
L = D-A,\ \text {where}\ d_i = \sum_{j|(i,j)\in E}w_{ij},\
\text{Thus},\ L_{ij} =
\begin{cases}
d_i, & \text{if}\ i=j\
-w_{ij}& \text{if}\ (i,j)\in E\
0 & \text{if}\ (i,j)\notin E
\end{cases}
$$
A: the adjacency matrix
D: the degree matrix
Then we find eigenvaluse and eigenvectors for L:
Step3 - Create Clusters
For 2 clusters:
We first assign each element of $v_2$ to the nodes, and then split the nodes such that all nodes with value>0 are in one cluster, while all the others in the other cluster.
For k clusters:
We get the normalized graph Laplacian:
$$
l_{norm}=D^{-1/2}LD^{1/2}
$$
- Compute the first k eigenvectors ${v_1, v_2, \dots, v_k}$
- Stack the first k eigenvectors vertically to form a matrix with the vectors as columns
- Represent every node by the corresponding row of this new matrix. These rows form the feature vectors of the nodes.
- Use K-Means Clustering to now cluster these points into k clusters ${C_1, C_2,\dots,C_k}$