Graph cut point of view

The intuition of clustering is to separate points in different groups according to their similarities. For data given in form of a similarity graph, this problem can be restated as follows: we want to find a partition of the graph such that the edges between different groups have a very low weight (which means that points in different clusters are dissimilar from each other) and the edges within a group have high weight (which means that points within the same cluster are similar to each other).

Given a similarity graph with adjacency matrix $W$ , the simplest and most direct way to construct a partition of the graph is to solve the mincut problem. To define it, please recall the notation $W(A, B) := \sum_{i\in A,j\in B} w_{ij}$ and $\bar A$ for the complement of $A$ . For a given number $k$ of subsets, the mincut approach simply consists in choosing a partition $A_1,..., A_k$ which minimizes

$cut(A_1,..., A_k):=\frac {1}{2}\sum_{i=1}^k W(A_i, \bar A_i)$

In particular for $k$ = 2, mincut is a relatively easy problem and can be solved efficiently. However, in practice it often does not lead to satisfactory partitions. The problem is that in many cases, the solution of mincut simply separates one individual vertex from the rest of the graph. The two most common objective functions to encode this are ```RatioCut``` (Hagen and Kahng, 1992) and the normalized cut``` Ncut``` (Shi and Malik, 2000). In RatioCut, the size of a subset A of a graph is measured by its number of vertices $|A|$ , while in Ncut the size is measured by the weights of its edges $vol(A)$ . The definitions are:

$RatioCut(A_1,..., A_k):=\frac {1}{2}\sum_{i=1}^k \frac {W(A_i, \bar A_i)}{|A_i|} = \frac {1}{2}\sum_{i=1}^k \frac {cut(A_i, \bar A_i)}{|A_i|}$
$Ncut(A_1,..., A_k):=\frac {1}{2}\sum_{i=1}^k \frac {W(A_i, \bar A_i)}{vol(A_i)} = \frac {1}{2}\sum_{i=1}^k \frac {cut(A_i, \bar A_i)}{vol(A_i)}$

So what both objective functions try to achieve is that the clusters are "balanced", as measured by the number of vertices or edge weights, respectively. Unfortunately, introducing balancing conditions makes the previously simple to solve mincut problem become NP hard.Spectral clustering is a way to solve relaxed versions of those problems. We will see that relaxing Ncut leads to normalized spectral clustering, while relaxing RatioCut leads to unnormalized spectral clustering

Approximating RatioCut for k = 2

Let us start with the case of RatioCut and $k = 2$ , because the relaxation is easiest to understand in this setting. Our goal is to solve the optimization problem
$\underset{A \subset V} {min} \space RatioCut(A, \bar A) \space (1)$
We first rewrite the problem in a more convenient form. Given a subset $A \subset V$ we define the vector $f = (f_1, ..., f_n)' \in \Bbb R^n$ with entries
$f_i= \begin{cases} \sqrt {|\bar A|/|A|} & \text { if } v_i \in A \\ - \sqrt {| A|/|\bar A|} & \text { if } v_i \in \bar A \\ \end{cases}$
Now the RatioCut objective function can be conveniently rewritten using the unnormalized graph Laplacian. This is due to the following calculation:
The problem of minimizing (1) can be equivalently rewritten as $\underset{A \subset V} {min} \space f'Lf \text{ subject to } f \bot \Bbb 1,\|f\|=\sqrt n$ This is a discrete optimization problem as the entries of the solution vector $f$ are only allowed to take two particular values, and of course it is still NP hard. The most obvious relaxation in this setting is to discard the discreteness condition and instead allow that fi takes arbitrary values in ❘. This leads to the relaxed optimization problem $\underset{f \in \Bbb R^n} {min} \space f'Lf \text{ subject to } f \bot \Bbb 1,\|f\|=\sqrt n$ By the Rayleigh-Ritz theorem (e.g., see Section 5.5.2. of L¨utkepohl, 1997) it can be seen immediately that the solution of this problem is given by the vector f which is the eigenvector corresponding to the second smallest eigenvalue of L (recall that the smallest eigenvalue of L is 0 with eigenvector $\Bbb 1$ ). So we can approximate a minimizer of RatioCut by the second eigenvector of L. However, in order to obtain a partition of the graph we need to re-transform the real-valued solution vector f of the relaxed problem into a discrete indicator vector. The simplest way to do this is to use the sign of f as indicator function, that is to choose 咩有看懂
Approximating RatioCut for arbitrary $k$

The relaxation of the RatioCut minimization problem in the case of a general value $k$ follows a similar principle as the one above. Given a partition of $V$ into k sets $A_1, . . . , A_k$ , we define $k$ indicator vectors $h_j = (h_{1,j} , . . . , h_{n,j })'$ by $h_{i,j}= \begin{cases} 1/\sqrt {|A_j|} & \text { if } v_i \in A_j \\ 0 & \text { otherwise } \end{cases} (i = 1, . . . , n; j = 1, . . . , k).$ Then we set the matrix $H \in \Bbb R^{n×k}$ as the matrix containing thosek indicator vectors as columns.Observe that the columns in $H$ are orthonormal to each other, that is $H'H = I$ . Similar to the calculations in the last section we can see that $RatioCut(A_1,..., A_k) = \sum_{i=1}^k h_i' L h_i = \sum_{i=1}^k (H' L H)_{ii} = Tr(H' L H)$ where $Tr$ denotes the trace of a matrix. So the problem of minimizing RatioCut( $A_1, . . . , A_k$ ) can be rewritten as $\underset{A_1, . . . , A_k} {min} \space Tr(H' L H) \text{ subject to } H'H=I$ Similar to above we now relax the problem by allowing the entries of the matrix $H$ to take arbitrary real values. Then the relaxed problem becomes: $\underset{H \in \Bbb R^{n \times k}} {min} \space Tr(H' L H) \text{ subject to } H'H=I$ This is the standard form of a trace minimization problem, and again a version of the Rayleigh-Ritz theorem (e.g., see Section 5.2.2.(6) of L¨utkepohl, 1997) tells us that the solution is given by choosing H as the matrix which contains the first k eigenvectors of L as columns. We can see that the matrix H is in fact the matrix U used in the unnormalized spectral clustering algorithm as described in Section 4. Again we need to re-convert the real valued solution matrix to a discrete partition. As above, the standard way is to use the k-means algorithms on the rows of U. This leads to the general unnormalized spectral clustering algorithm as presented in Section 4.
Approximating Ncut

$\underset{A_1, . . . , A_k} {min} \space Tr(H' L H) \text{ subject to } H'DH=I$

$\underset{H \in \Bbb R^{n \times k}} {min} \space Tr(T'D^{-1/2} L'D^{-1/2} T) \text{ subject to } T'T=I$ Again this is the standard trace minimization problem which is solved by the matrix $T$ which contains the first $k$ eigenvectors of $L_sym$ as columns. Re-substituting $H = D^{-1/2}T$ and using Proposition 3 we see that the solution $H$ consists of the first k eigenvectors of the matrix $L_{rw}$ , or the first $k$ generalized eigenvectors of $Lu = \lambda Du$ . This yields the normalized spectral clustering algorithm according to Shi and Malik
Comments on the relaxation approach

There are several comments we should make about this derivation of spectral clustering. Most importantly, there is no guarantee whatsoever on the quality of the solution of the relaxed problem compared to the exact solution.

That is, if $A_1, . . . , A_k$ is the exact solution of minimizing RatioCut, and $B_1, . . . , B_k$ is the solution constructed by unnormalized spectral clustering, then $RatioCut(B_1, . . . , B_k)−RatioCut(A_1, . . . , A_k)$ can be arbitrary large.

In general it is known that efficient algorithms to approximate balanced graph cuts up to a constant factor do not exist. To the contrary, this approximation problem can be NP hard itself

Of course, the relaxation we discussed above is not unique. For example, a completely different relaxation which leads to a semi-definite program is derived in Bie and Cristianini (2006), and there might be many other useful relaxations. The reason why the spectral relaxation is so appealing is not that it leads to particularly good solutions. Its popularity is mainly due to the fact that it results in a standard linear algebra problem which is simple to solve

This toy data set consists of a random sample of 200 points x1, . . . , x200 2 R drawn according to a mixture of four Gaussians.

K-nearest neighbour

Graph cut point of view

Approximating RatioCut for k = 2

Approximating `RatioCut` for arbitrary $k$

Approximating Ncut

Comments on the relaxation approach

Graph cut point of view

Approximating RatioCut for k = 2

Approximating RatioCut for arbitrary k

Approximating Ncut

Comments on the relaxation approach

Approximating `RatioCut` for arbitrary $k$