Similarity graphs

Graph notation

Let $G = (V, E)$ be an undirected graph with vertex set $V =\{v_1, ... , v_n\}$ . In the following we assume that the graph $G$ is weighted, that is each edge between two vertices $v_i$ and $v_j$ carries a non-negative weight $w_{ij} \geq 0$ . The weighted adjacency matrix of the graph is the matrix $W = (w_{ij}) \space i,j=1,...,n$ . If $w_{ij} = 0$ this means that the vertices $v_i$ and $v_j$ are not connected by an edge. As $G$ is undirected we require $w_{ij} = w_{ji}$ . The degree of a vertex $v_i \in V$ is defined as
$d_i = \sum_{j=1}^n w_{ij}$ .

Note that, in fact, this sum only runs over all vertices adjacent to $v_i$ , as for all other vertices $v_j$ the weight $w_{ij}$ is 0. The degree matrix D is defined as the diagonal matrix with the degrees $d_1,..., d_n$ on the diagonal. Given a subset of vertices $A \subset V$ , we denote its complement $V \setminus A$ by $A$ . We define the indicator vector $\Bbb I_A = (f_1, . . . , f_n)' \in \Bbb R^n$ as the vector with entries $f_i = 1 \space \text {if} \space v_i \in A$ and $f_i = 0$ otherwise. For convenience we introduce the shorthand notation $i \in A$ for the set of indices $\{i | v_i \in A\}$ , in particular when dealing with a sum like $\sum_{i\in A} w_{ij}$ . For two not necessarily disjoint sets $A, B \subset V$ we define

$W(A, B) := \sum_{i\in A, j \in B} w_{ij}$
We consider two different ways of measuring the "size" of a subset $A\subset V$ :
$|A| := \text { the number of vertices in } A$
$vol(A) := \sum_{i\in A}d_i.$

Intuitively, $|A|$ measures the size of $A$ by its number of vertices, while $vol(A)$ measures the size of $A$ by summing over the weights of all edges attached to vertices in $A$ . $A$ subset $A \subset V$ of a graph is connected if any two vertices in $A$ can be joined by a path such that all intermediate points also lie in $A$ . $A$ subset $A$ is called a connected component if it is connected and if there are no connections between vertices in $A$ and $A$ . The nonempty sets $A_1, ...,A_k$ and $A_1 \cup...\cup A_k = V .$
Different similarity graphs

There are several popular constructions to transform a given set $x_1,...,x_n$ of data points with pairwise similarities $s_{ij}$ or pairwise distances dij into a graph. When constructing similarity graphs the goal is to model the local neighborhood relationships between the data points.

The $\epsilon$ -neighborhood graph: Here we connect all points whose pairwise distances are smaller than $\epsilon$ . As the distances between all connected points are roughly of the same scale (at most $\epsilon$ ), weighting the edges would not incorporate more information about the data to the graph. Hence, the $\epsilon$ -neighborhood graph is usually considered as an unweighted graph.

$k$ -nearest neighbor graphs: Here the goal is to connect vertex $v_i$ with vertex $v_j$ if $v_j$ is among the k-nearest neighbors of $v_i$ . However, this definition leads to a directed graph, as the neighborhood relationship is not symmetric. There are two ways of making this graph undirected.
- The first way is to simply ignore the directions of the edges, that is we connect $v_i$ and $v_j$ with an undirected edge if $v_i$ is among the k-nearest neighbors of $v_j$ or if $v_j$ is among the k-nearest neighbors of $v_i$ . The resulting graph is what is usually called thek-nearest neighbor graph.
- The second choice is to connect vertices $v_i$ and $v_j$ if both $v_i$ is among the k-nearest neighbors of $v_j$ and $v_j$ is among the k-nearest neighbors of $v_i$ . The resulting graph is called the mutual k-nearest neighbor graph.
In both cases, after connecting the appropriate vertices we weight the edges by the similarity of their endpoints.

The fully connected graph: Here we simply connect all points with positive similarity with each other, and we weight all edges by $s_ij$ . As the graph should represent the local neighborhood relationships, this construction is only useful if the similarity function itself models local neighborhoods. An example for such a similarity function is the Gaussian similarity function
$s(x_i,x_j) = exp(- \frac{\|x_i-x_j\|^2}{2\sigma^2})$
where the parameter controls the width of the neighborhoods. This parameter plays a similar role as the parameter in case of the -neighborhood graph.

All graphs mentioned above are regularly used in spectral clustering. To our knowledge, theoretical results on the question how the choice of the similarity graph influences the spectral clustering result do not exist.

Similarity graphs

Graph notation

Different similarity graphs