2.3 Discussion

The design of plyranges adheres to well understood principles of language and API design: cognitive consistency, cohesion, endomorphism and expressiveness (T R G Green AND Petre 1996). To varying degrees, these principles also underlie the design of dplyr and the Bioconductor infrastructure.

We have aimed for plyranges to have a simple and direct mapping to the user’s cognitive model, i.e., how the user thinks about the data. This requires careful selection of the level of abstraction so that the user can express workflows in the language of genomics. This motivates the adoption of the tidy GRanges object as our central data structure. The basic data.frame and dplyr tibble lack any notion of genomic ranges and so could not easily support our genomic grammar, with its specific verbs for range-oriented data manipulation. Another example of cognitive consistency is how plyranges is insensitive to direction/strand by default when, e.g., detecting overlaps. GenomicRanges has the opposite behavior. We believe that defaulting to purely spatial overlap is most intuitive to most users.

To further enable cognitive consistency, plyranges functions are cohesive. A function is defined to be cohesive if it performs a singular task without producing any side-effects. Singular tasks can always be broken down further at lower levels of abstraction. For example, to resize a range, the user needs to specify which position (start, end, midpoint) should be invariant over the transformation. The resize() function from the GenomicRanges package has a fix argument that sets the anchor, so calling resize() coalesces anchoring and width modification. The coupling at the function call level is justified since the effect of setting the width depends on the anchor. However, plyranges increases cohesion and decouples the anchoring into its own function call.

Increasing cohesion simplifies the interface to each operation, makes the meaning of arguments more intuitive, and relies on function names as the primary means of expression, instead of a more complex mixture of function and argument names. This results in the user being able to conceptualize the plyranges DSL as a flat catalog of functions, without having to descend further into documentation to understand a function’s arguments. A flat function catalog also enhances API discoverability, particularly through auto-completion in integrated developer environments (IDEs). One downside of pushing cohesion to this extreme is that function calls become coupled, and care is necessary to treat them as a group when modifying code.

Like dplyr, plyranges verbs are functional: they are free of side effects and are generally endomorphic, meaning that when the input is a GRanges object they return a GRanges object. This enables chaining of verbs through syntax like the forward pipe operator from the magrittr package. This syntax has a direct cognitive mapping to natural language and the intuitive notion of pipelines. The low-level object-oriented APIs of Bioconductor tend to manipulate data via sub-replacement functions, like start(gr) <- x. These ultimately produce the side effect of replacing a symbol mapping in the current environment and thus are not amenable to so-called fluent syntax.

Expressiveness relates to the information content in code: the programmer should be able to clarify intent without unnecessary verbosity. For example, our overlap-based join operations are more concise than the multiple steps necessary to achieve the same effect in the original GenomicRanges API. In other cases, the plyranges API increases verbosity for the sake of clarity and cohesion. Explicitly calling anchor() can require more typing, but the code is easier to comprehend. Another example is the set of routines for importing genomic annotations, including read_gff(), read_bed(), and read_bam(). Compared to the generic import() in rtracklayer, the explicit format-based naming in plyranges clarifies intent and the type of data being returned. Similarly, every plyranges function that computes with strand information indicates its intentions by including suffixes such as directed, upstream or downstream in its name, otherwise strand is ignored. The GenomicRanges API does not make this distinction explicit in its function naming, instead relying on a parameter that defaults to strand sensitivity, an arguably confusing behavior.

The implementation of plyranges is built on top of Bioconductor infrastructure, meaning most functions are constructed by composing generic functions from core Bioconductor packages. As a result, any Bioconductor packages that uses data structures that inherit from GRanges will be able to use plyranges for free. Another consequence of building on top of Bioconductor generics is that the speed and memory usage of plyranges functions are similar to the highly optimized methods implemented in Bioconductor for GRanges objects.

A caveat to constructing a compatible interface with dplyr is that plyranges makes extensive use of non-standard evaluation in R via the rlang package (Henry and Wickham 2017). Simply, this means that computations are evaluated in the context of the GRanges objects. Both dplyr and plyranges are based on the rlang language, because it allows for more expressive code that is free of repeated references to the container. Implicitly referencing the container is particularly convenient when programming interactively. Consequently, when programming with plyranges, a user needs to generally understand the rlang language and how to adapt their code accordingly. Users familiar with the tidyverse should already have such knowledge.