Abstract

Exploratory data analysis is vital to modern science workflows; it allows scientists to grasp problems with their data and generate new hypotheses. This work explores three facets of exploratory data analysis workflows as applied to biological data science: data wrangling, integration and visualisation. It contributes new statistical computing interfaces and frameworks with the explicit aim of enabling scientists to understand their data and models in their biological context. In chapter 2 we show that genomics data can be represented using tidy data semantics, and consequently the process of wrangling it can be simplified via our grammar of genomic data transformation. The next contribution is exploring the implications of our grammar on the integration and representation of genomics data. In chapter 3, we provide a framework for integrating genomics data from multiple assays, via combining model estimates over their genomic regions. Next we extend our grammar to represent single variable measurements along the genome in multiple ways; in chapter 4 we present a software tool that allows coverage scores to be aggregated and visualised over an experimental design and genomic features and use this tool to uncover intron signal in RNA-seq data. Finally, in chapter 5 we contribute a new visualisation interface that provides scientists with a toolkit for discovering structure in their high dimensional data, and assist them in understanding when non-linear dimension reduction has worked appropriately.

–>