The Hubness Phenomenon

In Section 1 we gave a simple set-based deterministic definition of $N_k$ .
To complement this definition and introduce $N_k$ into a probabilistic setting, let $\mathbf x,\mathbf x_1,...,\mathbf x_n$ , be $n + 1$ random vectors drawn from the same continuous probability distribution with support $S \subseteq \Bbb R^d, d \in \{1,2,...\}$ , and let $dist$ be a distance function defined on $\Bbb R^d$ (not necessarily a metric).
Let functions $p_{i,k}$ , where $i,k \in {1,2,...,n}$ , be defined as

$p_{i,k}(\mathbf x)=\begin{cases} 1, &{\text{if x is among the k nearest neighbors of } \mathbf x_i, \text{according to } dist} \\ 0, &\text{if otherwise } \end{cases}$

In this setting, we define $N_k(x)=\sum_{i=1}^np_{i,k}(x)$ , that is, $N_k(\mathbf x)$ is the random number of vectors from $\Bbb R^d$ that have $\mathbf x$ included in their list of k nearest neighbors. In this section we will empirically demonstrate the emergence of hubness through increasing skewness of the distribution of $N_k$ on synthetic and real data, relating the increase of skewness with the dimensionality of data sets, and motivating the subsequent study into the origins of the phenomenon.