3.2 Import Data as a SummarizedExperiment

3.2.1 Using tximeta to import RNA-seq quantification data

First, we specify a directory path, where the quantification files are stored. You could simply specify this directory with:

path <- "/path/to/quant/files"

where the path is relative to your current R session. However, in this case we have distributed the files in the macrophage package. The relevant directory and associated files can be located using system.file().

path <- system.file("extdata", package="macrophage")

Information about the experiment is contained in the coldata.csv file. We leverage the dplyr and readr packages (as part of the tidyverse) to read this file into R (Wickham et al. 2019). We will see later that plyranges extends these packages to accommodate genomic ranges.

library(readr)
library(dplyr)
colfile <- file.path(path, "coldata.csv")
coldata <- read_csv(colfile) %>%
  select(
    names,
    id = sample_id,
    line = line_id,
    condition = condition_name
  ) %>%
  mutate(
    files = file.path(path, "quants", names, "quant.sf.gz"),
    line = factor(line),
    condition = relevel(factor(condition), "naive")
  )
glimpse(coldata)
#> Rows: 24
#> Columns: 5
#> $ names     <chr> "SAMEA103885102", "SAMEA103885347", "SAMEA103885043", "SAME…
#> $ id        <chr> "diku_A", "diku_B", "diku_C", "diku_D", "eiwy_A", "eiwy_B",…
#> $ line      <fct> diku_1, diku_1, diku_1, diku_1, eiwy_1, eiwy_1, eiwy_1, eiw…
#> $ condition <fct> naive, IFNg, SL1344, IFNg_SL1344, naive, IFNg, SL1344, IFNg…
#> $ files     <chr> "/Users/slee0046/thesis/renv/library/R-4.0/x86_64-apple-dar…

After we have read the coldata.csv file, we select relevant columns from this table, create a new column called files, and transform the existing line and condition columns into factors. In the case of condition, we specify the “naive” cell line as the reference level. The files column points to the quantifications for each observation – these files have been gzipped, but would typically not have the ‘gz’ ending if used from Salmon directly. One other thing to note is the use of the pipe operator, %>%, which can be read as “then”, i.e. first read the data, then select columns, then mutate them.

Now we have a table summarizing the experimental design and the locations of the quantifications. The following lines of code do a lot of work for the analyst: importing the RNA-seq quantification (dropping inferential replicates in this case), locating the relevant reference transcriptome, attaching the transcript ranges to the data, and fetching genome information. Inferential replicates are especially useful for performing transcript-level analysis, but here we will use a point estimate for the per-gene counts and perform gene-level analysis. The result is a SummarizedExperiment object.

library(SummarizedExperiment)
library(tximeta)
se <- tximeta(coldata, dropInfReps=TRUE)
se
#> class: RangedSummarizedExperiment 
#> dim: 205870 24 
#> metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
#> assays(3): counts abundance length
#> rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
#>   ENST00000387460.2 ENST00000387461.2
#> rowData names(3): tx_id gene_id tx_name
#> colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
#>   SAMEA103884949
#> colData names(4): names id line condition

On a machine with a working internet connection, the above command works without any extra steps, as the tximeta function obtains any necessary metadata via FTP, unless it is already cached locally. The tximeta package can also be used without an internet connection, in this case the linked transcriptome can be created directly from a Salmon index and gtf.

makeLinkedTxome(
  indexDir=file.path(path, "gencode.v29_salmon_0.12.0"),
  source="Gencode",
  organism="Homo sapiens",
  release="29",
  genome="GRCh38",
  fasta="gencode.v29.transcripts.fa.gz", # ftp link to fasta file
  gtf=file.path(path, "gencode.v29.annotation.gtf.gz"), # local version
  write=FALSE
)

Because tximeta knows the correct reference transcriptome, we can ask tximeta to summarize the transcript-level data to the gene level using the methods of Soneson, Love, and Robinson (2015).

gse <- summarizeToGene(se)

One final note is that the start of positive strand genes and the end of negative strand genes is now dictated by the genomic extent of the isoforms of the gene (so the start and end of the reduced GRanges). Another alternative would be to either operate on transcript abundance, and perform differential analysis on transcript, and so avoid defining the transcription start site (TSS) of a set of isoforms, or to use gene-level summarized expression but to pick the most representative TSS based on isoform expression.

3.2.2 Importing ATAC-seq data as a SummarizedExperiment object

The SummarizedExperiment object containing ATAC-seq peaks can be created from the following tab-delimited files from Alasoo and Gaffney (2017):

  • The sample metadata: ATAC_sample_metadata.txt.gz (<1M)
  • The matrix of normalized read counts: ATAC_cqn_matrix.txt.gz (109M)
  • The annotated peaks: ATAC_peak_metadata.txt.gz (5.6M)

To begin, we read in the sample metadata, following similar steps to those we used to generate the coldata table for the RNA-seq experiment:

atac_coldata <- read_tsv("ATAC_sample_metadata.txt.gz") %>%
  select(
    sample_id,
    donor,
    condition = condition_name
  ) %>%
  mutate(condition = relevel(factor(condition), "naive"))

The ATAC-seq counts have already been normalized with cqn (Hansen, Irizarry, and Wu 2012) and \(\log_2\) transformed. Loading the cqn-normalized matrix of \(\log_2\) transformed read counts takes ~30 seconds and loads an object of ~370 Mb. We set the column names so that the first column contains the rownames of the matrix, and the remaining columns are the sample identities from the atac_coldata object.

atac_mat <- read_tsv(
  "ATAC_cqn_matrix.txt.gz",
  skip = 1,
  col_names = c("rownames", atac_coldata[["sample_id"]])
 )
rownames <- atac_mat[["rownames"]]
atac_mat <- as.matrix(atac_mat[,-1])
rownames(atac_mat) <- rownames

We read in the peak metadata (locations in the genome), and convert it to a GRanges object. The as_granges() function automatically converts the data.frame into a GRanges object. From that result, we extract the peak_id column and set the genome information to the build “GRCh38”. We know this from the Zenodo entry.

library(plyranges)
peaks_df <- read_tsv(
  "ATAC_peak_metadata.txt.gz",
  col_types = c("cidciicdc")
  )

peaks_gr <- peaks_df %>%
  as_granges(seqnames = chr) %>%
  select(peak_id=gene_id) %>%
  set_genome_info(genome = "GRCh38")

Finally, we construct a SummarizedExperiment object. We place the matrix into the assays slot as a named list, the annotated peaks into the row-wise ranges slot, and the sample metadata into the column-wise data slot:

atac <- SummarizedExperiment(
  assays = list(cqndata=atac_mat),
  rowRanges = peaks_gr,
  colData = atac_coldata
)