A Motivating Example
We start with an illustrative experiment which demonstrates the changes in the distribution of with varying dimensionality.
- Let us consider a random data set consisting of 10000 d-dimensional points, whose
- components are independently drawn from the uniform distribution in range [0,1]
- the following distance functions: Euclidean (), fractional , and cosine.
- Figure 1(a–c) shows the empirically observed distributions of , with k = 5, for (a) d = 3, (b) d = 20, and (c) d = 100.
- In the same way, Figure 1(d–f) depicts the empirically observed for points randomly drawn from the i.i.d. normal distribution.
Description:
- d=3:
- For d = 3 the empirical distributions of for the three distance functions (Figure 1(a, d)) are consistent with the binomial distribution. This is expected by considering k-occurrences as node indegrees in the k-nearest neighbor digraph.
- For randomly distributed points in low dimensions, the degree distributions of the digraphs closely resemble the degree distribution of the Erdos-R ˝ enyi (ER) ´ random graph model, which is is binomial and Poisson in the limit .
- d increases:
- As dimensionality increases, the observed distributions of depart from the random graph model and become more skewed to the right.
Conclusion:
We verified this by being able to fit major right portions (that is, tails) of the observed distributions with the log-normal distribution, which is highly skewed.
- We made similar observations with various k values, distance measures, and data distributions.
- k values: generally focusing on the common case << where n is the number of points in a data set
- distance measures: distances for both and , Bray-Curtis, normalized Euclidean, and Canberra
- In virtually all these cases, skewness exists and produces hubs, that is, points with high k-occurrences.
- One exception visible in Figure 1 is the combination of cosine distance and normally distributed data.
In most practical settings, however, such situations are not expected, and a thorough discussion of the necessary conditions for hubness to occur in high dimensions will be given in Section 5.2.