Hubness in Real Data
As in the previous section, a considerable increase in the skewness of the distributions can be observed with increasing dimensionality.
In all, we examined 50 real data sets from well known sources, belonging to three categories: UCI multidimensional data, gene expression microarray data, and textual data in the bag-of-words representation, listed in Table 1.
The table includes columns that describe data-set sources (2nd column), basic statistics (data transformation (3rd column): whether standardization was applied, or for textual data the used bag-of-words document representation; the number of points (n, 4th column); dimensionality (d, 5th column); the number of classes (7th column)), and the distance measure used (Euclidean or cosine, 8th column).
- To characterize the asymmetry of we use the standardized third moment of the distribution of k-occurrences,
where and are the mean and standard deviation of , respectively.
- The corresponding (9th) column of Table 1, which shows the empirical values for the real data sets, indicates that the distributions of for most examined data sets are skewed to the right.
- The value of k is fixed at 10, but analogous observations can be made with other values of k.
- It can be observed that some values in Table 1 are quite high, indicating strong hubness in the corresponding data sets. Moreover, computing the Spearman correlation between d and over all 50 data sets reveals it to be strong (0.62), signifying that the relationship between dimensionality and hubness extends from synthetic to real data in general.
- On the other hand, careful scrutiny of the charts in Figure 2 and values in Table 1 reveals that for real data the impact of dimensionality on hubness may not be as strong as could be expected after viewing hubness on synthetic data in Figure 1.