Here we explore the use of sneezy on the 10x genomics PBMC3k single-cell dataset. We follow the preprocessing and quality control steps from Chapter 25 of Orchestrating Single Cell Analysis with R, and use that to explore the linear and non-linear embedding spaces via tours and diagnostics.

Setup

First, we load up the raw data from the TENxPBMCData package, and the analysis packages scater and scran:

library(TENxPBMCData)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#> 
#>     anyMissing, rowMedians
#> Loading required package: BiocParallel
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply, rowsum
#> Loading required package: HDF5Array
#> Loading required package: rhdf5
library(scater)
#> Loading required package: ggplot2
library(scran)

pbmc3k <- TENxPBMCData("pbmc3k")
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> snapshotDate(): 2019-10-22
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> downloading 1 resources
#> retrieving 1 resource
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> loading from cache
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
#> Using temporary cache /tmp/RtmpfExBWY/BiocFileCache
pbmc3k
#> class: SingleCellExperiment 
#> dim: 32738 2700 
#> metadata(0):
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
#>   ENSG00000215611
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> spikeNames(0):
#> altExpNames(0):

Next we follow the simple QC steps from the aforementioned vignette:

Quality control

is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]

Normalization

Variance modelling

The TourExperiment object and a first embedding

We provide a simple interface to constructing a TourExperiment from a SingleCellExperiment object. Once we have this object, we can construct linear embedding using PCA. Since, the underlying object is backed by an HDF5 file we use randomized PCA for speed purposes.

The arguments above add an embedding to the pbmc3k_te object, using randomized PCA, only on the highly variable genes we’ve constructed above. Since, the procedure is random, we also set a seed.

The usual approach here is to inspect the results of a cluster analysis on top of an embedding, here we have used the standard

g <- buildSNNGraph(pbmc3k_te, k=15, use.dimred = 'pca_random')
clust <- igraph::cluster_louvain(g)
pbmc3k_te$cluster <- factor(igraph::membership(clust))

We can also estimate a basis set via a grand tour:

This creates an array with dimensions 10, 2, 10 named pca_random. To access the underlying array use basisSet(pbmc3k_te).

We can then view the principal components using the view_tour_xy(), this is a shiny application. You can view an example of it here:

sneezy_tour(pbmc3k_te)