Practical details

Constructing the similarity graph
- The similarity function itself
  In the common case where the data points live in the Euclidean space $\Bbb R^d$ , a reasonable default candidate is theGaussian similarity function $s(x_i, x_j ) = exp(-\frac {\|x_i - x_j\|^2}{2\sigma^2})$ (but of course we need to choose the parameter $\sigma$ here). Ultimately, the choice of the similarity function depends on the domain the data comes from, and no general advice can be given.
- Which type of similarity graph
  - $\epsilon$ -neighborhood graph
    The points on the middle moon are already very tightly connected, while the points in the Gaussian are barely connected. This problem always occurs if we have data "on different scales", that is the distances between data points are different in different regions of the space.
  - $k$ -nearest neighbor graph
    Points in the low-density Gaussian are connected with points in the high-density moon; the $k$ -nearest neighbor graph can break into several disconnected components if there are high density regions which are reasonably far away from each other.
  - Mutual $k$ -nearest neighbor graph
    It tends to connect points within regions of constant density, but does not connect regions of different densities with each other. Hence, the mutual $k$ -nearest neighbor graph seems particularly well-suited if we want to detect clusters of different densities.
  - The fully connected graph
    The fully connected graph is very often used in connection with the Gaussian similarity function $s(x_i, x_j ) = exp(-\frac {\|x_i - x_j\|^2}{2\sigma^2})$ . Points in local neighborhoods are connected with relatively high weights, while edges between far away points have positive, but negligible weights. However, the resulting similarity matrix is not a sparse matrix.
  As a general recommendation we suggest to work with the k-nearest neighbor graph as the first choice.It is simple to work with, results in a sparse adjacency matrix $W$ , and in our experience is less vulnerable to unsuitable choices of parameters than the other graphs
- The parameters of the similarity graph
  
  In general, if the similarity graph contains more connected components than the number of clusters we ask the algorithm to detect, then spectral clustering will trivially return connected components as clusters.Unless one is perfectly sure that those connected components are the correct clusters, one should make sure that the similarity graph is connected, or only consists of "few" connected components and very few or no isolated vertices. There are many theoretical results on how connectivity of random graphs can be achieved, but all those results only hold in the limit for the sample size $n \rightarrow \infty$ .For example, it is known that for n data points drawn i.i.d. from some underlying density with a connected support in $\Bbb R^d$ ,
  - the k-nearest neighbor graph and the mutual k-nearest neighbor graph will be connected if we choose $k$ on the order of $log(n)$ (e.g., Brito, Chavez, Quiroz, and Yukich, 1997).
  - Similar arguments show that the parameter $\epsilon$ in the $\epsilon$ -neighborhood graph has to be chosen as $(log(n)/n)^d$ to guarantee connectivity in the limit (Penrose, 1999).
  While being of theoretical interest, all those results do not really help us for choosing $k$ on a finite sample.
  - The k-nearest neighbor graph
    The connectivity parameter should be chosen such that the resulting graph is connected, or at least has significantly fewer connected components than clusters we want to detect. For small or medium-sized graphs this can be tried out "by foot". For very large graphs, a first approximation could be to choose k in the order of $log(n)$ , as suggested by the asymptotic connectivity results.
  - The mutual k-nearest neighbor graph
    The advantage of the mutual k-nearest neighbor graph compared to the standard k-nearest neighbor graph is that it tends not to connect areas of different density. While this can be good if there are clear clusters induced by separate high-density areas, this can hurt in less obvious situations as disconnected parts in the graph will always be chosen to be clusters by spectral clustering. Very generally, one can observe that the mutual k-nearest neighbor graph has much fewer edges than the standard k-nearest neighbor graph for the same parameter k. This suggests to choose k significantly larger for the mutual k-nearest neighbor graph than one would do for the standard k-nearest neighbor graph. However, to take advantage of the property that the mutual k-nearest neighbor graph does not connect regions of different density, it would be necessary to allow for several “meaningful” disconnected parts of the graph. Unfortunately, we do not know of any general heuristic to choose the parameter k such that this can be achieved.
  - $\epsilon$ -neighborhood graph
    To determine the smallest value of $\epsilon$ where the graph is connected is very simple: one has to choose $\epsilon$ as the length of the longest edge in a minimal spanning tree of the fully connected graph on the data points. The latter can be determined easily by any minimal spanning tree algorithm. However,note that when the data contains outliers this heuristic will choose $\epsilon$ so large that even the outliersare connected to the rest of the data. A similar effect happens when the data contains several tight clusters which are very far apart from each other. In both cases, $\epsilon$ will be chosen too large to reflect the scale of the most important part of the data.
  - The fully connected graph
    If one uses a fully connected graph together with a similarity function which can be scaled itself, for example the Gaussian similarity function, then the scale of the similarity function should be chosen such that the resulting graph has similar properties as a corresponding k-nearest neighbor or $\epsilon$ -neighborhood graph would have.In particular, for the Gaussian similarity function several rules of thumb are frequently used. For example, one can choose $\epsilon$ in the order of the mean distance of a point to its k-th nearest neighbor, where $k$ is chosen similarly as above (e.g., $k \sim log(n) + 1$ ). Another way is to determine $\epsilon$ by the minimal spanning treeheuristic described above, and then choose $\sigma = \epsilon$ . But note that all those rules of thumb are very ad-hoc, and depending on the given data at hand and its distribution of inter-point distances they might not work at all.
  In general, experience shows that spectral clustering can be quite sensitive to changes in the similarity graph and to the choice of its parameters. Unfortunately, to our knowledge there has been no systematic study which investigates the effects of the similarity graph and its parameters on clustering and comes up with well-justified rules of thumb. None of the recommendations above is based on a firm theoretic ground. Finding rules which have a theoretical justification should be considered an interesting and important topic for future research.
Computing the eigenvectors

To implement spectral clustering in practice one has to compute the first k eigenvectors of a potentially large graph Laplace matrix. Luckily, if we use the k-nearest neighbor graph or the ε-neighborhood graph, then all those matrices are sparse. Efficient methods exist to compute the first eigenvectors of sparse matrices, the most popular ones being the power method or Krylov subspace methods such as the Lanczos method (Golub and Van Loan, 1996). The speed of convergence of those algorithms depends on the size of the eigengap (also called spectral gap) $\gamma_k = |\lambda_k−\lambda_{λk+1}|$ . The larger this eigengap is, the faster the algorithms computing the first k eigenvectors converge.

Note that a general problem occurs if one of the eigenvalues under consideration has multiplicity larger than one. For example, in the ideal situation of k disconnected clusters, the eigenvalue 0 has multiplicity $k$ . As we have seen, in this case the eigenspace is spanned by the k cluster indicator vectors. But unfortunately, the vectors computed by the numerical eigensolvers do not necessarily converge to those particular vectors. Instead they just converge to some orthonormal basis of the eigenspace, and it usually depends on implementation details to which basis exactly the algorithm converges. But this is not so bad after all. Note that all vectors in the space spanned by the cluster indicator vectors $\Bbb 1_{A_i}$ have the form $u = \sum_{i=1}^k a_i \Bbb 1_{A_i}$ for some coefficients $a_i$ , that is, they are piecewise constant on the clusters. So the vectors returned by the eigensolvers still encode the information about the clusters, which can then be used by the k-means algorithm to reconstruct the clusters.
The number of clusters

Choosing the number k of clusters is a general problem for all clustering algorithms, and a variety of more or less successful methods have been devised for this problem. One tool which is particularly designed for spectral clustering is the eigengap heuristic, which can be used for all three graph Laplacians. Here the goal is to choose the number $k$ such that all eigenvalues $\lambda_1, . . . , \lambda_k$ are very small, but $\lambda_{k+1}$ is relatively large. There are several justifications for this procedure. The first one is based on perturbation theory, where we observe that in the ideal case of $k$ completely disconnected clusters, the eigenvalue 0 has multiplicity k, and then there is a gap to the $(k + 1)_{th}$ eigenvalue $\lambda_{k+1} \gt 0$ . Other explanations can be given by spectral graph theory. Here, many geometric invariants of the graph can be expressed or bounded with the help of the first eigenvalues of the graph Laplacian. In particular, the sizes of cuts are closely related to the size of the first eigenvalues.

So we can see that the heuristic works well if the clusters in the data are very well pronounced. However, the more noisy or overlapping the clusters are, the less effective is this heuristic.
The k-means step

//日后再看
Which graph Laplacian should be used?

Practical details

Constructing the similarity graph

The similarity function itself

Which type of similarity graph

The parameters of the similarity graph

Computing the eigenvectors

The number of clusters

The k-means step

Which graph Laplacian should be used?