5.2 Overview of Dimension Reduction
In the following section, we provide a brief overview of the goals of DR methods with respect to the data analyst, in addition to their high level the mathematical details. Here we restrict our attention to two recent methods that are commonly used in the literature: t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold alignment and projection (UMAP); however we do mention several other techniques (Maaten and Hinton 2008; McInnes, Healy, and Melville 2018).
To begin we suppose the data is in the form rectangular matrix of real numbers, \(X = [\mathbf{x_1}, \dots, \mathbf{x_n}]\), where \(n\) is the number of observations in \(p\) dimensions. The purpose of any DR algorithm is to find a low-dimensional representation of the data, \(Y = [\mathbf{y_1}, \dots, \mathbf{y_n}]\), such that \(Y\) is an \(n \times d\) matrix where \(d \ll p\). The hope of the analyst is that the DR procedure to produce \(Y\) will remove noise in the original dataset while retaining any latent structure.
DR methods can be classified into two broad classes: linear and non-linear methods. Linear methods perform a linear transformation of the data, that is, \(Y\) is a linear transformation of \(X\); one example is principal components analysis (PCA) which performs an eigendecomposition of the estimated sample covariance matrix (Hotelling 1933). The eigenvalues are sorted in decreasing order and represent the variance explained by each component (eigenvector). A common approach to deciding on the number of principal components to retain is to plot the proportion of variance explained by each component and choose a cut-off.
For non-linear methods \(Y\) is generated via a pre-processed form of the input \(X\) such as the \(k\)-nearest neighbours graph or via a kernel transformation. Multidimensional scaling (MDS) is a class of DR method that aims to construct an embedding \(Y\) such that the pair-wise distances (inner products) in \(Y\) approximate the pair-wise distances (inner products) in \(X\) (Torgerson 1952; Kruskal 1964a). There are many variants of MDS, such as non-metric scaling which amounts to replacing distances with ranks instead (Kruskal 1964b). A related technique is Isomap which uses a \(k\)-nearest neighbour graph to estimate the pair-wise geodesic distance of points in \(X\) then uses classical MDS to construct \(Y\) (Silva and Tenenbaum 2003). Other approaches are based on diffusion processes such as diffusion maps (Coifman et al. 2005). A recent example of this approach is the PHATE algorithm (Moon et al. 2019). Here an affinity matrix is estimated via the pair-wise distance matrix and k-nearest neighbours graph of \(X\). The algorithm de-noises estimated distances in high-dimensional space via transforming the affinity matrix into a Markov transition probability matrix and diffusing this matrix over a fixed number of time steps. Then the diffused probabilities are transformed once more to construct a distance matrix, and classical MDS is used to generate \(Y\). A general difficulty with using non-linear DR methods for exploratory data analysis is selecting and tuning appropriate parameters. To see the extent of these choices we now examine the underpinnings of t-SNE and UMAP.
The t-SNE algorithm estimates the pair-wise similarity of (Euclidean) distances of points in a high dimensional space using a Gaussian distribution and then estimates a configuration in the low dimensional embedding space by modelling similarities using a t-distribution with 1 degree of freedom (Maaten and Hinton 2008). There are several subtleties to the to use of the algorithm that are revealed by stepping through its machinery.
To begin, t-SNE transforms pair-wise distances between \(\mathbf{x_i}\) and \(\mathbf{x_j}\) to similarities using a Gaussian kernel:
\[ p_{i|j} = \frac{\exp(-\lVert \mathbf{x_i - x_j} \rVert ^ 2 / 2\sigma_i^2)}{\sum_{k \ne i}\exp(-\lVert \mathbf{x_j - x_k} \rVert ^ 2 / 2\sigma_i^2)} \]
The conditional probabilities are then normalised and symmetrised to form a joint probability distribution via averaging:
\[ p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} \]
The variance parameter of the Gaussian kernel is controlled by the analyst using a fixed value of perplexity for all observations:
\[ \text{perplexity}_i = \exp(-\log(2) \sum_{i \ne j}p_{j|i}\log_2(p_{j|i})) \] As the perplexity increases, \(\sigma^2_{i}\) increases, until it is bounded above by the number of observations , \(n-1\), in the data, corresponding to \(\sigma^2_{i} \rightarrow \infty\). This essentially turns \(t-SNE\) into a nearest neighbours algorithm, \(p_{i|j}\) will be close to zero for all observations that are not in the \(\mathcal{O}(\text{perplexity}_i)\) neighbourhood graph of the \(i\)th observation (Maaten 2014).
Next, the target low-dimensional space, \(Y\), pair-wise distances between \(\mathbf{y_i}\) and \(\mathbf{y_j}\) are modelled as a symmetric probability distribution using a t-distribution with one degree of freedom (Cauchy kernel):
\[ q_{ij} = \frac{w_{ij}}{Z} \\ \text{where } w_{ij} = \frac{1}{ 1 + \lVert \mathbf{y_i - y_j} \rVert ^ 2} \text{ and } Z = \sum_{k \ne l} w_{kl}. \]
The resulting embedding \(Y\) is the one that minimizes the Kullback-Leibler divergence between the probability distributions formed via similarities of observations in \(X\), \(\mathcal{P}\) and similarities of observations in \(Y\), \(\mathcal{Q}\):
\[ \mathcal{L(\mathcal{P}, \mathcal{Q})} = \sum_{i \ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}}\]
Re-writing the loss function in terms of attractive (right) and repulsive (left) forces we obtain:
\[ \mathcal{L(\mathcal{P}, \mathcal{Q})} = -\sum_{i \ne j} p_{ij}\log w_{ij} + \log\sum_{i \ne j} w_{ij} \]
So when the loss function is minimised this corresponds to large attractive forces, that is, the pair-wise distances in \(Y\) are small when there are non-zero \(p_{ij}\), i.e. \(x_i\) and \(x_j\) are close together. The repulsive force should also be small when the loss function is minimised, that is, when pair-wise distances in \(Y\) are large regardless of the magnitude of the corresponding distances in \(X\).
Taken together, these details reveal the sheer number of decisions that an analyst must make. How does one choose the perplexity? How should the parameters that control the optimisation of the loss function (done with stochastic gradient descent), like the number of iterations, or early exaggeration ( a multiplier of the attractive force at the beginning of the optimisation), or the learning rate be selected? It is a known problem that t-SNE can have trouble recovering topology and that configurations can be highly dependent on how the algorithm is initialised and parameterized (Wattenberg, Viégas, and Johnson 2016; Kobak and Berens 2019; Melville 2020). If the goal is cluster orientation a recent theoretical contribution by Linderman and Steinerberger (2019) proved that t-SNE can recover spherical and well separated cluster shapes, and proposed new approaches for tuning the optimisation parameters. However, the cluster sizes and their relative orientation from a \(t-SNE\) view can be misleading perceptually, due to the algorithms emphasis on locality.
Another recent method, UMAP, has seen a large rise in popularity (at least in single cell transcriptomics) (McInnes, Healy, and Melville 2018). It is a method that is related to LargeVis (Tang et al. 2016), and like t-SNE acts on the k-nearest neighbour graph. Its main differences are that it uses a different cost function (cross entropy) which is optimized using stochastic gradient descent and defines a different kernel for similarities in the low dimensional space. Due to its computational speed it is possible to generate UMAP embeddings in more than three dimensions. It appears to suffer from the same perceptual issues as t-SNE, however it supposedly preserves global structure better than t-SNE (Coenen and Pearce 2019).
5.2.1 Tours explore the subspace of \(d\)-dimensional projections
The tour is a visualisation technique that is grounded in mathematical theory, and is able to ascertain the shape and global structure of a dataset via inspection of the subspace generated by the set of low-dimensional projections (Asimov 1985; Buja and Asimov 1986).
Like when using other DR techniques, the tour assumes we have a real data matrix \(X\) consisting of \(n\) observations in \(p\) dimensions. First, the tour generates a sequence of \(p \times d\) orthonormal projection matrices (bases) \(A_{t \in \mathbb{N}}\), where \(d\) is typically \(1\) or \(2\). For each pair of orthonormal bases \(A_{t}\) and \(A_{t+1}\) that are generated, the geodesic path between them is interpolated to form intermediate frames, and giving the sense of continuous movement from one basis to another. The tour is then the continuous visualisation of the projections \(Y_{t} = XA_{t}\), that is the projection of \(X\) onto \(A_{t}\) as the tour path is interpolated between successive bases. A grand tour corresponds to choosing new orthonormal bases at random; allowing a user to ascertain structure via exploring the subspace of \(d\)-dimensional projections. In practice, we first sphere our data via principal components to reduce dimensionality of \(X\) prior to running the tour. Instead of picking projections at random, a guided tour can be used to generate a sequence of ‘interesting’ projections as quantified by an index function (Cook et al. 1995). While our software, liminal, is able to visualise guided tours, our focus in the case studies uses the grand tour to see global structure in the data.