I've been toying with the idea of a fixed/constant width Ranges subclass. The motivation comes from storing DNA methylation data at CH loci (non-CpG methylation): there are 1.1 billion CH loci in the human genome, so to store these as a GRanges object requires 2 x 1.1 billion integer vectors, one for the @start and one for the @width slots of the IRanges object in the @ranges slot. But in this case, and perhaps others, such as storing SNP data, we have a situation where all loci have the same width, namely 1. Of course, you might argue such a 2-fold reduction in size is purely academic, but I think it could be a nice efficiency that's worth pursuing.
I've sketched out two different prototypes, neither of which I've worked up to a complete implementation; I'd like to get some feedback on these two designs, along with a variation that I've not yet even tried implementing, before I decide how/whether to proceed. The two approaches are: 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better name suggestions). a. The @width slot would be an integer vector of length 1 b. [variation not yet implemented] The @width slot would be an Rle vector parallel to @start 2. Modifying the IRanges class. The @width slot may be a integer vector of length 1 or a vector parallel to @start [Upon reflection, I suppose there could be a '2b' where the @width slot is an Rle, but I'm going to ignore this for now since in general it would be inefficient when the ranges have (random) variable widths] # Pros of 1 - It seems the proper thing is to create a new Ranges subclass - No dangers associated with stuffing around with internals of the IRanges class and clean code separation # Pros of 1b compared to 1a - Like for IRanges, the @width slot would remain parallel to the @start slot # Cons of 1 - Can't immediately use in a GRanges object because the @ranges slot is classed as an IRanges object - Perhaps this could be changed to allow a Ranges object in the @ranges slot of a GRanges object? - Otherwise, would also need to implement a subclass of GenomicRanges (say, FWGRanges) that used a FWRanges object in the @ranges slot. This would necessitate a fair bit of code duplicated from GRanges methods. - Methods like start<-, end<-, width<- would either have to - (A) return an error if the new object no longer has fixed/constant widths - (B) coerce it to an IRanges object (with or without warning) thus meaning these operations would not be strict endomorphisms - Users would only get the space-savings of the FWRanges class if they explicitly construct a FWRanges object or coerce a compatible IRanges object to an FWRanges object - Clean code separation from the IRanges class may also lead to duplicated code # Cons of 1b compared to 1a - Endomorphic versions of methods like start<-, end<-, width<- could create a @width slot that is twice the 'necessary' size (e.g., an Rle representation of a vector that contains no 'runs'). # Pros of 2 - If properly implemented, the user wouldn't need to think about whether the ranges were fixed or variable width, they'd just get the most efficient representation # Cons of 2 - This is fairly obvious, 2 would be a major (internal) change to a core Bioconductor class - The @width slot would no longer necessarily be parallel to @start slot, e.g., code that does direct slot access via @width could easily break (of course, the width() getter would be modified to return a parallel vector to the @start slot, but people (*cough* me) have code that does the wrong thing with respect to the use of getters vs. direct slot access) - New IRanges objects may be incompatible with earlier version of IRanges Your feedback is very appreciated, Pete _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel