A Motivating Example

We start with an illustrative experiment which demonstrates the changes in the distribution of $N_k$ with varying dimensionality.

Let us consider a random data set consisting of 10000 d-dimensional points, whose
- components are independently drawn from the uniform distribution in range [0,1]
- the following distance functions: Euclidean ( $l_2$ ), fractional $l_{0.5}$ , and cosine.
- Figure 1(a–c) shows the empirically observed distributions of $N_k$ , with k = 5, for (a) d = 3, (b) d = 20, and (c) d = 100.
- In the same way, Figure 1(d–f) depicts the empirically observed $N_k$ for points randomly drawn from the i.i.d. normal distribution.

Figure 1(a–c) shows the empirically observed distributions of $N_k$, with k = 5, for (a) d = 3, (b) d = 20, and (c) d = 100. In the same way, Figure 1(d–f) depicts the empirically observed $N_k$ for points randomly drawn from the i.i.d. normal distribution.

Description:

d=3:
- For d = 3 the empirical distributions of $N_5$ for the three distance functions (Figure 1(a, d)) are consistent with the binomial distribution. This is expected by considering k-occurrences as node indegrees in the k-nearest neighbor digraph.
- For randomly distributed points in low dimensions, the degree distributions of the digraphs closely resemble the degree distribution of the Erdos-R ˝ enyi (ER) ´ random graph model, which is is binomial and Poisson in the limit .
d increases:
- As dimensionality increases, the observed distributions of $N_5$ depart from the random graph model and become more skewed to the right.

Conclusion:

We verified this by being able to fit major right portions (that is, tails) of the observed distributions with the log-normal distribution, which is highly skewed.

We made similar observations with various k values, distance measures, and data distributions.
- k values: generally focusing on the common case $k$ << $n$ where n is the number of points in a data set
- distance measures: $l_p-norm$ distances for both $p \ge 1$ and $0 < p < 1$ , Bray-Curtis, normalized Euclidean, and Canberra
In virtually all these cases, skewness exists and produces hubs, that is, points with high k-occurrences.
One exception visible in Figure 1 is the combination of cosine distance and normally distributed data.
~~In most practical settings, however, such situations are not expected, and a thorough discussion of the necessary conditions for hubness to occur in high dimensions will be given in Section 5.2~~.