5.1 Introduction
High dimensional data is increasingly prevalent in the natural sciences and beyond but presents a challenge to the analyst in terms of data cleaning, pre-processing and visualisation. Methods to embed data from a high-dimensional space into a low-dimensional one now form a core step of the data analysis workflow where they are used to ascertain hidden structure and de-noise data for downstream analysis .
Choosing an appropriate embedding presents a challenge to the analyst. How does an analyst know whether the embedding has captured the underlying topology and geometry of the high dimensional space? The answer depends on the analyst’s workflow. Brehmer et al. (2014) characterised two main workflow steps that an analyst performs when using embedding techniques: dimension reduction and cluster orientation. The first relates to dimension reduction achieved by using an embedding method, here an analyst wants to characterise and map meaning onto the embedded form, for example identifying batch effects from a high throughput sequencing experiment, or identifying a gradient or trajectory along the embedded form (Nguyen and Holmes 2019). The second relates to using embeddings as part of a clustering workflow. Here analysts are interested in identifying and naming clusters and verifying them by either applying known labels or colouring by variables that are a-priori known to distinguish clusters. Both of these workflow steps rely on the embedding being representative of the original high dimensional dataset, and becomes much more difficult when there is no underlying ground truth.
As part of a visualization workflow, it’s important to consider the perception and interpretation of embedding methods as well. Sedlmair, Munzner, and Tory (2013) showed that scatter plots were mostly sufficient for detecting class separation, however they also noted that often multiple embeddings were required. For the task of cluster identification, Lewis, Van der Maaten, and Sa (2012) showed experimentally that novice users of non-linear embedding techniques were more likely to consider clusters of points on a scatter plot to be the result of a spurious embedding compared to advanced users who were aware of the inner workings of the embedding algorithm.
A complementary approach for visualizing structure in high dimensional data is the tour. A tour is a sequence of projections of a high dimensional dataset onto a low-dimensional basis matrix, and is represented as an animated visualization (Asimov 1985; Buja and Asimov 1986). Given the dynamic nature of the tour, user interaction is important for controlling and exploring the visualisation: the tour has been used previously by Wickham, Cook, and Hofmann (2015) for exploring statistical model fits and by Buja, Cook, and Swayne (1996) for exploring the space of factorial experimental designs.
While there has been much work on the algorithmic details of embedding methods, there are relatively few tools designed to assist users to interact with these techniques: when is an embedding sufficient for the task at hand? Several interactive interfaces have been proposed for evaluating or using embedding techniques. Buja et al. (2008) used tours to guide analysts during the optimisation of multidimensional scaling methods by extending their interactive visualisation software called XGobi and GGobi into a new tool called GGvis (Swayne, Cook, and Buja 1998; Swayne et al. 2003; Swayne and Buja 2004). Their interface allows the analyst to dynamically modify and check whether an MDS configuration has preserved the locality and closeness of points between the configuration and the original data. Ovchinnikova and Anders (2020) created the Sleepwalk interface for checking non-linear embeddings in single cell RNA-seq data. It provides a click and highlight visualisation for colouring points in an embedding according to an estimated pairwise distance in the original high-dimensional space. Similarly, the TensorFlow embedding projector is a web interface to running some non-linear embedding methods live in the browser and provides interactions to colour points, and select nearest neighbours (Smilkov et al. 2016). Finally, the work by Pezzotti et al. (2017) provides a user guided and modified form of the t-SNE algorithm, that allows users to modify optimisation parameters in real-time.
There is no one-size fits all: finding an appropriate embedding for a given dataset is a difficult and a somewhat poorly defined problem. For non-linear methods, there are many parameters to explore that can have an effect on the resulting visualisation and interpretation. Interfaces for evaluating embeddings require interaction but should also be able to be incorporated into an analysts workflow. Instead, we implement a more pragmatic workflow by incorporating interactive graphics and tours with embeddings that allows users to see a global overview of their high dimensional data and assists them with cluster orientation tasks.
The rest of the paper is organised as follows. The next section provides background on dimension reduction methods, including an overview of the tour. Then we describe the visual design of liminal, followed by implementation details. Next we provide case studies that show how our interface assists in using embedding methods. Finally, we describe the insights gained by using liminal and plans for extensions to the software.