5.3 Visual Design
Tours provide a supportive visualisation to NLDR graphics, and can be easily incorporated into an analysts workflow with our software package, liminal. Our interface allows analysts to quickly compare views from embedding methods and see how an embedding method preserves or alters the geometry of their data. Using multiple concatenated and linked views with the tour enhances interaction techniques, and allows analysts to perform cluster orientation tasks via linked highlighting and brushing (McDonald 1982; Becker and Cleveland 1987). This approach allows our interface to achieve the three principles for interactive high-dimensional data visualisation outlined by Buja, Cook, and Swayne (1996): finding gestalt (identifying patterns in visual forms), posing queries, and making comparisons.
5.3.1 Finding Gestalt: focus and context
To investigate latent structure and the shape of a high dimensional dataset, a tour can be run without the use of an external embedding. It is often useful to first run principal components on the input as an initial dimension reduction step, and then tour a subset of those components instead, i.e. by selecting them from a scree plot. The default tour layout is a scatter plot with an axis layout displaying the magnitude and direction of each basis vector. Since the tour is dynamic, it is often useful to be able to pause and highlight a particular view. In addition to pause, play and reset buttons, brushing will pause the tour path, allowing users to identify ‘interesting’ projections. The domain of the axis scales from running a tour is called the half range, and is computed by rescaling the input data onto hyper-dimensional unit cube. We bind the half range to a mouse wheel event, allowing a user to pan and zoom on the tour view dynamically. This is useful for peeling back dense clumps of points to reveal structure.
5.3.2 Posing Queries: multiple views, many contexts
We have combined the tour view in a side by side layout with a scatter plot view as has been done in previous tour interfaces XGobi and DataViewer (Buja, Hurley, and McDonald 1986; Swayne, Cook, and Buja 1998). These views are linked; analysts can brush regions or highlight collections of points in either view. Linked highlighting can be performed when points have been previously labelled according to some discrete structure, i.e. cluster labels are available. This is achieved via the analyst clicking on groups in the legend, which causes unselected groupings to have their points become less opaque. Consequently, simple linked highlighting can alleviate a known downfall of methods such as UMAP or t-SNE: that is distances between clusters are misleading. By highlighting corresponding clusters in the tour view, the analyst can see the relationship between clusters, and therefore obtain a more accurate representation of the topology of their data.
Simple linked brushing is achieved via mouse-click and drag movements. By default, when brushing occurs in the tour view, the current projection is paused and corresponding points in the embedding view are highlighted. Likewise, when brushing occurs in the embedding view, corresponding points in the tour view are highlighted. In this case, an analyst can use brushing for manually identifying clusters and verifying cluster locations and shapes: brushing in the embedding view gives analysts a sense of the shape and proximity of cluster in high-dimensional space.
5.3.3 Making comparisons: revising embeddings
As mentioned previously, when using any DR method, we are assuming the embedding is representative of the high-dimensional dataset it was computed from. Defining what it means for embedding to be ‘representative` or ’faithful’ to high-dimensional data is ill-posed and depends on the underlying task an analyst is trying to achieve. At the very minimum, we are interested in distortions and diffusions of the high-dimensional data. Distortions occur when points that are near each other in the embedding view are far from each other in the original dataset. This implies that the embedding is not continuous. Diffusions occur when points are far from each other in the embedding view are near in the original data. Whether points are near or far is reliant on the distance metric used; distortions and diffusions can be thought of as the preservation of distances or the nearest neighbours graphs between the high-dimensional space and the embedding space. As distances can be noisy in high-dimensions, ranks can be used instead as has been proposed by Lee and Verleysen (2009). Identifying distortions and diffusions allows an analyst to investigate the quality of their embedding and revise them iteratively.
These checks are done visually using our side-by-side tour and embedding views. In the simplest case, a local continuity check can be assessed via one to one linked brushing from the embedding to the tour view. Similarly, diffusions are identified from linked brushing on the tour view, highlighting points in the embedding view.