library(GenomicRanges) class?GPos On Wed, Nov 23, 2016 at 6:18 PM, Peter Hickey <peter.hic...@gmail.com> wrote:
> I've been toying with the idea of a fixed/constant width Ranges > subclass. The motivation comes from storing DNA methylation data at CH > loci (non-CpG methylation): there are 1.1 billion CH loci in the human > genome, so to store these as a GRanges object requires 2 x 1.1 billion > integer vectors, one for the @start and one for the @width slots of > the IRanges object in the @ranges slot. But in this case, and perhaps > others, such as storing SNP data, we have a situation where all loci > have the same width, namely 1. Of course, you might argue such a > 2-fold reduction in size is purely academic, but I think it could be a > nice efficiency that's worth pursuing. > > I've sketched out two different prototypes, neither of which I've > worked up to a complete implementation; I'd like to get some feedback > on these two designs, along with a variation that I've not yet even > tried implementing, before I decide how/whether to proceed. > > The two approaches are: > > 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better > name suggestions). > a. The @width slot would be an integer vector of length 1 > b. [variation not yet implemented] The @width slot would be an Rle > vector parallel to @start > 2. Modifying the IRanges class. The @width slot may be a integer > vector of length 1 or a vector parallel to @start > > [Upon reflection, I suppose there could be a '2b' where the @width > slot is an Rle, but I'm going to ignore this for now since in general > it would be inefficient when the ranges have (random) variable widths] > > # Pros of 1 > > - It seems the proper thing is to create a new Ranges subclass > - No dangers associated with stuffing around with internals of the > IRanges class and clean code separation > > # Pros of 1b compared to 1a > > - Like for IRanges, the @width slot would remain parallel to the @start > slot > > # Cons of 1 > > - Can't immediately use in a GRanges object because the @ranges slot > is classed as an IRanges object > - Perhaps this could be changed to allow a Ranges object in the > @ranges slot of a GRanges object? > - Otherwise, would also need to implement a subclass of GenomicRanges > (say, FWGRanges) that used a FWRanges object in the @ranges slot. This > would necessitate a fair bit of code duplicated from GRanges methods. > - Methods like start<-, end<-, width<- would either have to > - (A) return an error if the new object no longer has fixed/constant widths > - (B) coerce it to an IRanges object (with or without warning) thus > meaning these operations would not be strict endomorphisms > - Users would only get the space-savings of the FWRanges class if they > explicitly construct a FWRanges object or coerce a compatible IRanges > object to an FWRanges object > - Clean code separation from the IRanges class may also lead to duplicated > code > > # Cons of 1b compared to 1a > > - Endomorphic versions of methods like start<-, end<-, width<- could > create a @width slot that is twice the 'necessary' size (e.g., an Rle > representation of a vector that contains no 'runs'). > > # Pros of 2 > > - If properly implemented, the user wouldn't need to think about > whether the ranges were fixed or variable width, they'd just get the > most efficient representation > > # Cons of 2 > > - This is fairly obvious, 2 would be a major (internal) change to a > core Bioconductor class > - The @width slot would no longer necessarily be parallel to @start > slot, e.g., code that does direct slot access via @width could easily > break (of course, the width() getter would be modified to return a > parallel vector to the @start slot, but people (*cough* me) have code > that does the wrong thing with respect to the use of getters vs. > direct slot access) > - New IRanges objects may be incompatible with earlier version of IRanges > > Your feedback is very appreciated, > Pete > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel