2.3 Discussion
The design of plyranges adheres to well understood principles of language and API design: cognitive consistency, cohesion, endomorphism and expressiveness (T R G Green AND Petre 1996). To varying degrees, these principles also underlie the design of dplyr and the Bioconductor infrastructure.
We have aimed for plyranges to have a simple and direct mapping to the user’s cognitive model, i.e., how the user thinks about the data. This requires careful selection of the level of abstraction so that the user can express workflows in the language of genomics. This motivates the adoption of the tidy GRanges object as our central data structure. The basic data.frame and dplyr tibble lack any notion of genomic ranges and so could not easily support our genomic grammar, with its specific verbs for range-oriented data manipulation. Another example of cognitive consistency is how plyranges is insensitive to direction/strand by default when, e.g., detecting overlaps. GenomicRanges has the opposite behavior. We believe that defaulting to purely spatial overlap is most intuitive to most users.
To further enable cognitive consistency, plyranges functions are cohesive.
A function is defined to be cohesive if it performs a singular task without
producing any side-effects. Singular tasks can always be broken
down further at lower levels of abstraction.
For example, to resize a range, the user needs to specify which
position (start, end, midpoint) should be invariant over the
transformation. The resize()
function from the GenomicRanges package
has a fix
argument that sets the anchor, so calling resize()
coalesces anchoring and width modification. The coupling at the
function call level is justified since the effect of setting the width
depends on the anchor. However, plyranges increases cohesion and
decouples the anchoring into its own function call.
Increasing cohesion simplifies the interface to each operation, makes the meaning of arguments more intuitive, and relies on function names as the primary means of expression, instead of a more complex mixture of function and argument names. This results in the user being able to conceptualize the plyranges DSL as a flat catalog of functions, without having to descend further into documentation to understand a function’s arguments. A flat function catalog also enhances API discoverability, particularly through auto-completion in integrated developer environments (IDEs). One downside of pushing cohesion to this extreme is that function calls become coupled, and care is necessary to treat them as a group when modifying code.
Like dplyr, plyranges verbs are functional: they are free of side
effects and are generally endomorphic, meaning that when the input is a GRanges
object they return a GRanges object. This enables chaining of verbs
through syntax like the forward pipe operator from the magrittr package.
This syntax has a direct cognitive mapping to natural language and the intuitive notion of
pipelines. The low-level object-oriented APIs of Bioconductor tend to
manipulate data via sub-replacement functions, like start(gr) <- x
. These ultimately produce the side effect of replacing a symbol
mapping in the current environment and thus are not amenable to so-called
fluent syntax.
Expressiveness relates to the information content in code: the
programmer should be able to clarify intent without unnecessary
verbosity. For example, our overlap-based join operations are more
concise than the multiple steps necessary to achieve the same effect
in the original GenomicRanges API. In other cases, the plyranges API
increases verbosity for the sake of clarity and cohesion. Explicitly
calling anchor()
can require more typing, but the code is easier to
comprehend. Another example is the set of routines for importing
genomic annotations, including read_gff()
, read_bed()
, and
read_bam()
. Compared to the generic import()
in rtracklayer, the
explicit format-based naming in plyranges clarifies intent and the
type of data being returned. Similarly, every plyranges function that
computes with strand information indicates its intentions by including
suffixes such as directed, upstream or downstream in its name,
otherwise strand is ignored. The GenomicRanges API does not make this
distinction explicit in its function naming, instead relying on a
parameter that defaults to strand sensitivity, an arguably confusing
behavior.
The implementation of plyranges is built on top of Bioconductor infrastructure, meaning most functions are constructed by composing generic functions from core Bioconductor packages. As a result, any Bioconductor packages that uses data structures that inherit from GRanges will be able to use plyranges for free. Another consequence of building on top of Bioconductor generics is that the speed and memory usage of plyranges functions are similar to the highly optimized methods implemented in Bioconductor for GRanges objects.
A caveat to constructing a compatible interface with dplyr is that plyranges makes extensive use of non-standard evaluation in R via the rlang package (Henry and Wickham 2017). Simply, this means that computations are evaluated in the context of the GRanges objects. Both dplyr and plyranges are based on the rlang language, because it allows for more expressive code that is free of repeated references to the container. Implicitly referencing the container is particularly convenient when programming interactively. Consequently, when programming with plyranges, a user needs to generally understand the rlang language and how to adapt their code accordingly. Users familiar with the tidyverse should already have such knowledge.