INFLUENCE ON CLASSIFICATION ALGORITHMS

k-nearest neighbor classifier

The k-nearest neighbor classifier is negatively affected by the presence of "bad" hubs, because they provide erroneous class information to many other points.

To validate this assumption, we devised a simple weighting scheme. For each point , we compute its standardized "bad" hubness score:

where and are the mean and standard deviation of , respectively.

During majority voting in the k-NN classification phase, when point participates in the k-NN list of the point being classified, the vote of is weighted by

thus lowering the influence of "bad" hubs on the classification decision

Figure 11 compares the resulting accuracy of k-NN classifier with and without this weighting scheme for six data sets from Table 1.

  • Leave-one-out cross-validation is performed, with Euclidean distance being used for determining nearest neighbors.
  • The k value for is naturally set to the k value used by the k-NN classifier, and is recomputed for the training set of each fold.
  • The reduced accuracy of the unweighted scheme signifies the negative influence of "bad" hubs.

Figure11

Although "bad" hubs tend to carry more information about the location of class boundaries than other points, the "model" created by the k-NN classifier places the emphasis on describing non-borderline regions of the space occupied by each class.

For this reason, it can be said that "bad" hubs are truly bad for k-NN classification, creating the need to penalize their influence on the classification decision. On the other hand, for classifiers that explicitly model the borders between classes, such as support vector machines, "bad" hubs can represent points which contribute information to the model in a positive way.