Similarity graphs

  1. Graph notation

    Let be an undirected graph with vertex set . In the following we assume that the graph is weighted, that is each edge between two vertices and carries a non-negative weight . The weighted adjacency matrix of the graph is the matrix . If this means that the vertices and are not connected by an edge. As is undirected we require . The degree of a vertex is defined as

    .

    Note that, in fact, this sum only runs over all vertices adjacent to , as for all other vertices the weight is 0. The degree matrix D is defined as the diagonal matrix with the degrees on the diagonal. Given a subset of vertices , we denote its complement by . We define the indicator vector as the vector with entries and otherwise. For convenience we introduce the shorthand notation for the set of indices , in particular when dealing with a sum like . For two not necessarily disjoint sets we define

    We consider two different ways of measuring the "size" of a subset :

    Intuitively, measures the size of by its number of vertices, while measures the size of by summing over the weights of all edges attached to vertices in . subset of a graph is connected if any two vertices in can be joined by a path such that all intermediate points also lie in . subset is called a connected component if it is connected and if there are no connections between vertices in and . The nonempty sets and

  2. Different similarity graphs

    There are several popular constructions to transform a given set of data points with pairwise similarities or pairwise distances dij into a graph. When constructing similarity graphs the goal is to model the local neighborhood relationships between the data points.

    The -neighborhood graph: Here we connect all points whose pairwise distances are smaller than . As the distances between all connected points are roughly of the same scale (at most ), weighting the edges would not incorporate more information about the data to the graph. Hence, the -neighborhood graph is usually considered as an unweighted graph.

    -nearest neighbor graphs: Here the goal is to connect vertex with vertex if is among the k-nearest neighbors of . However, this definition leads to a directed graph, as the neighborhood relationship is not symmetric. There are two ways of making this graph undirected.

    • The first way is to simply ignore the directions of the edges, that is we connect and with an undirected edge if is among the k-nearest neighbors of or if is among the k-nearest neighbors of . The resulting graph is what is usually called thek-nearest neighbor graph.
    • The second choice is to connect vertices and if both is among the k-nearest neighbors of and is among the k-nearest neighbors of . The resulting graph is called the mutual k-nearest neighbor graph.

    In both cases, after connecting the appropriate vertices we weight the edges by the similarity of their endpoints.

    The fully connected graph: Here we simply connect all points with positive similarity with each other, and we weight all edges by . As the graph should represent the local neighborhood relationships, this construction is only useful if the similarity function itself models local neighborhoods. An example for such a similarity function is the Gaussian similarity function

    where the parameter controls the width of the neighborhoods. This parameter plays a similar role as the parameter in case of the -neighborhood graph.

    All graphs mentioned above are regularly used in spectral clustering. To our knowledge, theoretical results on the question how the choice of the similarity graph influences the spectral clustering result do not exist.