Related Work | K-nearest neighbour

Approaches for visualizing collections of documents exist and can be applied to large data matrices. However, in these works, $n$ was less than 30,000 (here we have ten times more documents). In [5],more than 500,000 documents were processed with the use of a hierarchical thesaurus, which is not available for the data we consider.

In this work, the size of the data matrix we must deal with is about n = 300, 000 lines (i.e., data items) and m = 800 columns (i.e., dimensions), which amounts to 240 millions values. Furthermore, we need to have fast interactions. These constraints imply that the visualization must be computed in a few seconds only. Any longer interaction times might slow down the user during the exploration process. In our previous work we were able to build a proximity graph with half of the French ODS (about 150,000 ODS), as represented in Figure 1. This computation, even with the use of intensive computing on GPUs, took 11 hours. Once computed and displayed, the graph revealed several clusters and other information [2]. However, only half of the collection was visualized, and even in such conditions, the interactions in the graph visualization software were too slow (for instance, several seconds were necessary for a simple zoom/pan). More importantly, if the user wanted to change the dimensions (for instance, simply ignoring one of them), then this required a complete re-computation of the graph, which was too long for an interactive exploration.

So we concentrated our attention on projection techniques that can map multi-dimensional data to a low dimensional space (typically 2D). Numerous such methods exists, ranging from Principal Component Analysis (PCA) to MultiDimensional Scaling (MDS). However, with the constraint of an interactive exploration of a large data matrix, many methods are not adapted to our problem. Sequential approaches are too slow for our dataset. So we focused on projection methods that use intensive and parallel computing in order to create visualizations, an approach which is still rare. $Glimmer$ is a very good example of such approaches. However, its complexity in n is too high $(\Theta(n^2))$ . There are other projection methods with a linear complexity, which seems more appropriate to our dataset. FastMap has been implemented on GPUs . RadViz is another important example [10] [11] in which the visualization can be modified and improved. This is why we studied these radial approaches.