Mechanisms Behind Hubness
Illustrate:
By definition, the distribution of distances is actually the Chi distribution with degrees of freedom (as the square root of the sum of squares of i.i.d. normal variables, Johnson et al., 1994).
In this setting, distance concentration refers to the fact:
- the standard deviation of distance distributions is asymptotically constant with respect to increasing , while the means of the distance distributions asymptotically behave like .
On the other hand, for -norm distances with , the standard deviation would tend to . However, for any finite , existing variation in the values of random coordinates causes some points to become closer to the distribution mean than others.
This happens despite the fact that all distance values, in general, may be increasing together with .
why points closer to the data mean become hubs in high dimensions
Within the i.i.d. normal setting, two points drawn from the data, but at specific positions with respect to the origin:
- point which is at the expected distance from the origin
- point which is two standard deviations closer.
In light of the above, the distances of and from the origin change with increasing d, and it could be said that different (and ) occupy analogous positions in the data spaces, with respect to changing d.
The distances of (and ) to all other points are distributed according to noncentral Chi distributions with d degrees of freedom and noncentrality parameter equaling the distance of () to the origin.
Figure 5(a) plots the probability density functions of these distributions for several values of d. It can be seen that, as d increases, the distance distributions for and move away from each other. This tendency is depicted more clearly in Figure 5(b) which plots the difference between the means of the two distributions with respect to d.
For points that are closer to the mean of the data distribution to be closer, on average, to all other points, for any value of d.
However, the above analysis indicates that this tendency is amplified by high dimensionality, making points that reside in the proximity of the data mean become closer (in relative terms) to all other points than their low-dimensional analogues are.
This tendency causes high-dimensional points that are closer to the mean to have increased inclusion probability into k-NN lists of other points, even for small values of k.
In terms of the notion of node centrality typically used in network analysis, the above discussion indicates that high dimensionality amplifies what we will call the spatial centrality of a point (by increasing its proximity to other points), which, in turn, affects the degree centrality of the corresponding node in the k-NN graph (by increasing its degree, that is, ).
The main statement of the theorem is given by Equation 1, which expresses the tendency of the difference between the means of the two distance distributions to increase with increasing dimensionality d. It is important to note that this tendency is obtained through analysis of distributions of data and distances, implying that the behavior is an inherent property of data distributions in highdimensional space, rather than an artefact of other factors, such as finite sample size, etc. Through simulation involving randomly generated points we verified the behavior for i.i.d. normal data by replicating very closely the chart shown in Figure 5(b). Furthermore, simulations suggest that the same behavior emerges in i.i.d. uniform data,9 as well as numerous other unimodal random data distributions, producing charts of the same shape as in Figure 5(b). Real data, on the other hand,