"Bad" hubs
"Bad" hubs, that is, points with high , are of particular interest to supervised learning because they carry more information about the location of the decision boundaries than other points, and affect classification algorithms in different ways.
To understand the origins of "bad" hubs in real data, we rely on the notion of the cluster assumption from semi-supervised learning, which roughly states that most pairs of points in a high density region (cluster) should be of the same class.
To measure the degree to which the cluster assumption is violated in a particular data set, we simply define the cluster assumption violation (CAV) coefficient as follows. Let be the number of pairs of points which are in different classes but in the same cluster, and the number of pairs of points which are in the same class and cluster. Then, we define
-
- which gives a number in the [0,1] range, higher if there is more violation.
To reduce the sensitivity of CAV to the number of clusters (too low and it will be overly pessimistic, too high and it will be overly optimistic), we choose the number of clusters to be 3 times the number of classes of a particular data set. Clustering is performed with K-means.