Hubness in Real Data

To extend these results to real data, we need to take into account two additional factors:

  • (1) real data sets usually contain dependent attributes
  • (2) real data sets are usually clustered, that is, points are organized into groups produced by a mixture of distributions instead of originating from a single (unimodal) distribution.

Dependent attributes

To examine the first factor (dependent attributes), we adopt the approach that used in the context of distance concentration.

For each data set we randomly permute the elements within every attribute. This way, attributes preserve their individual distributions, but the dependencies between them are lost and the intrinsic dimensionality of data sets increases, becoming equal to their embedding dimensionality .

In Table 1 (10th column) we give the empirical skewness, denoted as , of the shuffled data. For the vast majority of high-dimensional data sets, is considerably higher than , indicating that hubness actually depends on the intrinsic rather than embedding dimensionality. This provides an explanation for the apparent weaker influence of on hubness in real data than in synthetic data sets.

这里写图片描述