A Motivating Example

We start with an illustrative experiment which demonstrates the changes in the distribution of with varying dimensionality.

  • Let us consider a random data set consisting of 10000 d-dimensional points, whose
    • components are independently drawn from the uniform distribution in range [0,1]
    • the following distance functions: Euclidean (), fractional , and cosine.
    • Figure 1(a–c) shows the empirically observed distributions of , with k = 5, for (a) d = 3, (b) d = 20, and (c) d = 100.
    • In the same way, Figure 1(d–f) depicts the empirically observed for points randomly drawn from the i.i.d. normal distribution.

Figure 1(a–c) shows the empirically observed distributions of $N_k$, with k = 5, for (a) d = 3, (b) d = 20, and (c) d = 100.  In the same way, Figure 1(d–f) depicts the empirically observed  $N_k$ for points randomly drawn from the i.i.d. normal distribution.

Description:

  • d=3:
    • For d = 3 the empirical distributions of for the three distance functions (Figure 1(a, d)) are consistent with the binomial distribution. This is expected by considering k-occurrences as node indegrees in the k-nearest neighbor digraph.
    • For randomly distributed points in low dimensions, the degree distributions of the digraphs closely resemble the degree distribution of the Erdos-R ˝ enyi (ER) ´ random graph model, which is is binomial and Poisson in the limit .
  • d increases:
    • As dimensionality increases, the observed distributions of depart from the random graph model and become more skewed to the right.

Conclusion:

We verified this by being able to fit major right portions (that is, tails) of the observed distributions with the log-normal distribution, which is highly skewed.

  • We made similar observations with various k values, distance measures, and data distributions.
    • k values: generally focusing on the common case << where n is the number of points in a data set
    • distance measures: distances for both and , Bray-Curtis, normalized Euclidean, and Canberra
  • In virtually all these cases, skewness exists and produces hubs, that is, points with high k-occurrences.
  • One exception visible in Figure 1 is the combination of cosine distance and normally distributed data.
  • In most practical settings, however, such situations are not expected, and a thorough discussion of the necessary conditions for hubness to occur in high dimensions will be given in Section 5.2.