4.4 Discussion

We have shown how coverage can be represented in the tidy data framework and integrated with experimental metadata and reference annotations. This framework allowed us to build data descriptions that are simple aggregations of various aspects of genomic features over factors within a designed experiment and link those descriptions to their underlying coverage traces.

Our zebrafish workflow shows that our approach using superintronic and plyranges is able to uncover interesting biological signals in a purely data-driven manner. We did not include additional information that could have been useful when deriving our selected genes, such as sequence motifs for U12 class of introns, or exploit the experimental design to find differential IR like profiles. However, if that was of interest, one could look at the overlaps, like we did in Figure 4.4, or combine our data descriptors with external estimates using limma (Ritchie et al. 2015), like our proposed index method in S. Lee et al. (2020). The gene candidates obtained by our thresholds have been validated by the Heath lab using qPCR.

Although the example we have explored has related to finding coverage traces with IR-like events, the workflow of building and then visualising data descriptors could be generalised to other types of omics analyses, such as peaking finding in ChIP-seq, and to use more sophisticated methods for identifying thresholds of ‘interesting’ traces. Our approach would also greatly benefit from interactive graphics that dynamically link say a gene description to an underlying coverage trace, for rapid exploration. This is left for future work.