pace Wolfgang Huber ... Peter I don't mean to be rude. Your comments deserve more study. But it was fun to remember GPos, which I had forgotten.
On Wed, Nov 23, 2016 at 6:34 PM, Vincent Carey <st...@channing.harvard.edu> wrote: > library(GenomicRanges) > class?GPos > > On Wed, Nov 23, 2016 at 6:18 PM, Peter Hickey <peter.hic...@gmail.com> > wrote: > >> I've been toying with the idea of a fixed/constant width Ranges >> subclass. The motivation comes from storing DNA methylation data at CH >> loci (non-CpG methylation): there are 1.1 billion CH loci in the human >> genome, so to store these as a GRanges object requires 2 x 1.1 billion >> integer vectors, one for the @start and one for the @width slots of >> the IRanges object in the @ranges slot. But in this case, and perhaps >> others, such as storing SNP data, we have a situation where all loci >> have the same width, namely 1. Of course, you might argue such a >> 2-fold reduction in size is purely academic, but I think it could be a >> nice efficiency that's worth pursuing. >> >> I've sketched out two different prototypes, neither of which I've >> worked up to a complete implementation; I'd like to get some feedback >> on these two designs, along with a variation that I've not yet even >> tried implementing, before I decide how/whether to proceed. >> >> The two approaches are: >> >> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better >> name suggestions). >> a. The @width slot would be an integer vector of length 1 >> b. [variation not yet implemented] The @width slot would be an Rle >> vector parallel to @start >> 2. Modifying the IRanges class. The @width slot may be a integer >> vector of length 1 or a vector parallel to @start >> >> [Upon reflection, I suppose there could be a '2b' where the @width >> slot is an Rle, but I'm going to ignore this for now since in general >> it would be inefficient when the ranges have (random) variable widths] >> >> # Pros of 1 >> >> - It seems the proper thing is to create a new Ranges subclass >> - No dangers associated with stuffing around with internals of the >> IRanges class and clean code separation >> >> # Pros of 1b compared to 1a >> >> - Like for IRanges, the @width slot would remain parallel to the @start >> slot >> >> # Cons of 1 >> >> - Can't immediately use in a GRanges object because the @ranges slot >> is classed as an IRanges object >> - Perhaps this could be changed to allow a Ranges object in the >> @ranges slot of a GRanges object? >> - Otherwise, would also need to implement a subclass of GenomicRanges >> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This >> would necessitate a fair bit of code duplicated from GRanges methods. >> - Methods like start<-, end<-, width<- would either have to >> - (A) return an error if the new object no longer has fixed/constant >> widths >> - (B) coerce it to an IRanges object (with or without warning) thus >> meaning these operations would not be strict endomorphisms >> - Users would only get the space-savings of the FWRanges class if they >> explicitly construct a FWRanges object or coerce a compatible IRanges >> object to an FWRanges object >> - Clean code separation from the IRanges class may also lead to >> duplicated code >> >> # Cons of 1b compared to 1a >> >> - Endomorphic versions of methods like start<-, end<-, width<- could >> create a @width slot that is twice the 'necessary' size (e.g., an Rle >> representation of a vector that contains no 'runs'). >> >> # Pros of 2 >> >> - If properly implemented, the user wouldn't need to think about >> whether the ranges were fixed or variable width, they'd just get the >> most efficient representation >> >> # Cons of 2 >> >> - This is fairly obvious, 2 would be a major (internal) change to a >> core Bioconductor class >> - The @width slot would no longer necessarily be parallel to @start >> slot, e.g., code that does direct slot access via @width could easily >> break (of course, the width() getter would be modified to return a >> parallel vector to the @start slot, but people (*cough* me) have code >> that does the wrong thing with respect to the use of getters vs. >> direct slot access) >> - New IRanges objects may be incompatible with earlier version of IRanges >> >> Your feedback is very appreciated, >> Pete >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel