5.6 Discussion

We have shown that the use of tours as a tool for interacting with high dimensional data provides an additional insight for interrogating views generated from embeddings. The interface we have designed in the liminal package, allows a user to gain a deeper understanding of an embedding algorithm, and rectifies perceptual issues associated with NLDR methods via linked interactions with the tour. As we have shown in the simulation case studies, the t-SNE method can produce misleading embeddings which can be detected through the use linked brushing and highlighting. In the case when the data has a piecewise linear geometry, like the tree simulation, the tour preserves the shape of the data which can be obscured by the embedding method.

Our framework can also be useful in practice, as displayed in the fourth case study. The tour when combined with t-SNE allowed us to identify clusters, while giving us an idea of their orientation to each other. Moreover, we could visually inspect the separation of clusters using a tour on marker gene sets. We see our approach as being valuable to the single cell analyst who wants to make their embeddings more interpretable.

We have shown in the case studies, that one to one linked brushing can be used to identify distortions in the embedding, however we would like extend this to one to many linked brushing, which would allow us to directly interrogate neighbourhood preservation. This form of brushing acts directly on a \(k\)-nearest neighbours (\(k\)-nn) graph computed from a reference dataset: when a user brushes over a region in the embedding, all the points that match the graphs edges are selected on the corresponding tour view. The reference data set for computing nearest neighbours (for example a distance matrix, or the complete data matrix) can be independent of the tour or embedding views. In place of highlighting, one could use opacity or binned colour scales to encode distances or ranks instead of the neighbouring points. We have begun implementing this brush in liminal, using the FNN or RcppAnnoy packages for fast neighbourhood estimation on the server side, however there are still technicalities that need be resolved (Beygelzimer et al. 2019; Eddelbuettel 2020). Brush composition, such as ‘and’, ‘or’, or ‘not’ brushes, could be used to further investigate mismatches between the \(k\)-nn graphs estimated from both the embedding and tour views.

There are some limitations in using the liminal interface for larger datasets. First, t-SNE avoids the crowding problem, points are separated into distinct regions on the display canvas. For the tour, points are concentrated in the centre of the projection and become difficult to see. We have recently proposed a simple non-linear transformation for the tour called a sage tour that aims to fix this problem (Laa, Cook, and Lee 2020). Second, as \(n\) increases both the embedding view and tour view become harder to read due to over-plotting, while the interactivity and animation become slower as there is more data passing from the server to the client. For the tasks we have looked at in this paper, where shape and density are important to the analyst, we think that better displays and sub-sampling strategies are more useful than being able to look at every single point on the canvas. We showed in our single cell clustering case study that doing a weighted sample based on cluster membership still allowed us to get a sense of relative cluster orientation, however there are alternative sampling approaches that could be applied, like selecting points close to the cluster centres. Alternative displays via statistical transformations could also mitigate the need to show all of the data. Recent work by Laa et al. (2020) is a promising area for further investigation, as well as work from topological statistics (Rieck 2017; Genovese et al. 2017).