Chapter 6 Conclusion

In this thesis, I have designed tools to explore workflow steps that are integral to modern biological data science. In particular, I have implemented software that facilitates the wrangling, integration, and visualisation of high-throughput biological data in a principled and pragmatic manner. The early chapters of this thesis explored the tidy data semantic and its extension to range based genomics data. This culminated in the development of “plyranges: a grammar of genomic data transformation” in Chapter 2, which developed a new domain specific language for genomics data analysis. The applicability of the plyranges interface and use of the tidy data concept were further interrogated in Chapter 3, “Fluent genomics with plyranges and tximeta”, which described techniques integrating data along the genome, and emphasised the importance of interoperability between analysis tools. Similarly, Chapter 4, “Exploratory coverage analysis with superintronic and plyranges”, tackled data integration from a different angle by looking at multiple summaries of variables measured along the genome to find putative regions of intron retention. In the final part of the thesis I moved towards visualisation issues as they related to working with high-dimensional data common in biological data science. Chapter 5, “Casting multiple shadows: high-dimensional interactive data visualisation with tours and embeddings”, explored pragmatic approaches to high dimensional data visualisation in light of the rise of popular non-linear embedding methods.

A significant amount of my work has been devoted to the development of open source R packages and workflows: plyranges, fluentGenomics, superintronic and liminal. I have emphasised how coherent software packages are tools for thought; they enable analysts to reason about their data and models through the composition of workflows. To finish, I will discuss the implications of this work and provide suggestions for further research.