3.1 Introduction

In this workflow, we examine a subset of the RNA-seq and ATAC-seq data from Alasoo et al. (2018), a study that involved treatment of macrophage cell lines from a number of human donors with interferon gamma (IFNg), Salmonella infection, or both treatments combined. Alasoo et al. (2018) examined gene expression and chromatin accessibility in a subset of 86 successfully differentiated induced pluripotent stem cells (iPSC) lines, and compared baseline and response with respect to chromatin accessibility and gene expression at specific quantitative trait loci (QTL). The authors found that many of the stimulus-specific expression QTL were already detectable as chromatin QTL in naive cells, and further hypothesize about the nature and role of transcription factors implicated in the response to stimulus.

We will perform a much simpler analysis than the one found in Alasoo et al. (2018), using their publicly available RNA-seq and ATAC-seq data (ignoring the genotypes). We will examine the effect of IFNg stimulation on gene expression and chromatin accessibility, and look to see if there is an enrichment of differentially accessible (DA) ATAC-seq peaks in the vicinity of differentially expressed (DE) genes. This is plausible, as the transcriptomic response to IFNg stimulation may be mediated through binding of regulatory proteins to accessible regions, and this binding may increase the accessibility of those regions such that it can be detected by ATAC-seq.

An overview of the fluent genomics workflow. First, we import data as a SummarizedExperiment object, which enables interoperability with downstream analysis packages. Then we model our assay data, using the existing Bioconductor packages DESeq2 and limma. We take the results of our models for each assay with respect to their genomic coordinates, and integrate them. First, we compute the overlap between the results of each assay, then aggregate over the combined genomic regions, and finally summarize to compare enrichment for differentially expressed genes to non differentially expressed genes. The final output can be used for downstream visualization or further transformation.

Figure 3.1: An overview of the fluent genomics workflow. First, we import data as a SummarizedExperiment object, which enables interoperability with downstream analysis packages. Then we model our assay data, using the existing Bioconductor packages DESeq2 and limma. We take the results of our models for each assay with respect to their genomic coordinates, and integrate them. First, we compute the overlap between the results of each assay, then aggregate over the combined genomic regions, and finally summarize to compare enrichment for differentially expressed genes to non differentially expressed genes. The final output can be used for downstream visualization or further transformation.

Throughout the workflow (Figure 3.1), we will use existing Bioconductor infrastructure to understand these datasets. In particular, we will emphasize the use of the Bioconductor packages plyranges and tximeta and the SummarizedExperiment class (Figure 3.2). The plyranges package fluently transforms data tied to genomic ranges using operations like shifting, window construction, overlap detection, etc. It is described by Lee, Cook, and Lawrence (2019) and leverages underlying core Bioconductor infrastructure (Lawrence et al. 2013b; Huber et al. 2015b) and the tidyverse design principles Wickham et al. (2019).

A SummarizedExperiment orients bulk assay measurements as matrices where rows correspond to features, and columns correspond to samples. The rows have their own accessor functions which contains additional data about the feature of interest, in this rowRanges() returns a GRanges where rows are the genomic coordinates for each gene measured in the assay, while colData() returns a DataFrame where rows contain information about the samples.

Figure 3.2: A SummarizedExperiment orients bulk assay measurements as matrices where rows correspond to features, and columns correspond to samples. The rows have their own accessor functions which contains additional data about the feature of interest, in this rowRanges() returns a GRanges where rows are the genomic coordinates for each gene measured in the assay, while colData() returns a DataFrame where rows contain information about the samples.

The tximeta package described by Love et al. (2019) is used to read RNA-seq quantification data into R/Bioconductor, such that the transcript ranges and their provenance are automatically attached to the object containing expression values and differential expression results.

3.1.1 Experimental Data

The data used in this workflow is available from two packages: the macrophage Bioconductor ExperimentData package and from the workflow package fluentGenomics (Lee and Love, n.d.).

The macrophage package contains RNA-seq quantification from 24 RNA-seq samples, a subset of the RNA-seq samples generated and analyzed by Alasoo et al. (2018). The paired-end reads were quantified using Salmon (Patro et al. 2017), using the Gencode 29 human reference transcripts (Frankish, GENCODE-consoritum, and Flicek 2018). For more details on quantification, and the exact code used, consult the vignette of the macrophage package. The package also contains the Snakemake file that was used to distribute the Salmon quantification jobs on a cluster (Köster and Rahmann 2012).

The fluentGenomics package contains functionality to download and generate a cached SummarizedExperiment object from the normalized ATAC-seq data provided by Alasoo and Gaffney (2017). This object contains all 145 ATAC-seq samples across all experimental conditions as analyzed by Alasoo et al. (2018). The data can be also be downloaded directly from the Zenodo deposition.

The following code loads the path to the cached data file, or if it is not present, will create the cache and generate a SummarizedExperiment using the the BiocFileCache package (Shepherd and Morgan 2019).

library(fluentGenomics)
path_to_se <- cache_atac_se()

We can then read the cached file and assign it to an object called atac.

atac <- readRDS(path_to_se)

A precise description of how we obtained this SummarizedExperiment object can be found in section 3.2.2.