Gabe - very cool! I'll be following this with interest. Ryan - conceptually I haven't been thinking of fixed-width ranges as different from general ranges, hence why I think it'd be neat if the user just got the benefits of space-efficient representation without having to know/care about the underlying representation. My delineation by class is more of prototyping/conceptual convenience and my thinking of IRanges/FWRanges as being concrete implementations of the (virtual) Ranges class (albeit with FWRanges subject to additional constraints).
Cheers, Pete On Thu, 24 Nov 2016 at 14:14 Ryan <r...@thompsonclan.org> wrote: > > Hi all, > > In addition to the technical concerns, I suppose we should consider > whether fixed-width ranges are conceptually different enough from > general ranges to warrant a separate class, or whether this is just > being considered for purely technical reasons. My feeling is that > fixed-width ranges aren't sufficiently different from general ranges to > justify a separate class. The two main uses I can think of for > fixed-width ranges are genomic positions (i.e. length 1 ranges) and > cases like "1kb upstream of" or "1kb radius around" a set of specified > positions. But even for that case, fixed-wdith ranges are not > necessarily usable because a position less than 1kb from the end of a > chromosome would require a truncated range. (What behavior would we > expect from a hypothetical FWRanges class in this case?) > > -Ryan > > On 11/23/16 8:01 PM, Ryan wrote: > > Is it possible to allow the width slot of IRanges to be either a > > normal vector or an Rle? > > > > > > On 11/23/16 6:18 PM, Peter Hickey wrote: > >> I've been toying with the idea of a fixed/constant width Ranges > >> subclass. The motivation comes from storing DNA methylation data at CH > >> loci (non-CpG methylation): there are 1.1 billion CH loci in the human > >> genome, so to store these as a GRanges object requires 2 x 1.1 billion > >> integer vectors, one for the @start and one for the @width slots of > >> the IRanges object in the @ranges slot. But in this case, and perhaps > >> others, such as storing SNP data, we have a situation where all loci > >> have the same width, namely 1. Of course, you might argue such a > >> 2-fold reduction in size is purely academic, but I think it could be a > >> nice efficiency that's worth pursuing. > >> > >> I've sketched out two different prototypes, neither of which I've > >> worked up to a complete implementation; I'd like to get some feedback > >> on these two designs, along with a variation that I've not yet even > >> tried implementing, before I decide how/whether to proceed. > >> > >> The two approaches are: > >> > >> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better > >> name suggestions). > >> a. The @width slot would be an integer vector of length 1 > >> b. [variation not yet implemented] The @width slot would be an Rle > >> vector parallel to @start > >> 2. Modifying the IRanges class. The @width slot may be a integer > >> vector of length 1 or a vector parallel to @start > >> > >> [Upon reflection, I suppose there could be a '2b' where the @width > >> slot is an Rle, but I'm going to ignore this for now since in general > >> it would be inefficient when the ranges have (random) variable widths] > >> > >> # Pros of 1 > >> > >> - It seems the proper thing is to create a new Ranges subclass > >> - No dangers associated with stuffing around with internals of the > >> IRanges class and clean code separation > >> > >> # Pros of 1b compared to 1a > >> > >> - Like for IRanges, the @width slot would remain parallel to the > >> @start slot > >> > >> # Cons of 1 > >> > >> - Can't immediately use in a GRanges object because the @ranges slot > >> is classed as an IRanges object > >> - Perhaps this could be changed to allow a Ranges object in the > >> @ranges slot of a GRanges object? > >> - Otherwise, would also need to implement a subclass of GenomicRanges > >> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This > >> would necessitate a fair bit of code duplicated from GRanges methods. > >> - Methods like start<-, end<-, width<- would either have to > >> - (A) return an error if the new object no longer has fixed/constant > >> widths > >> - (B) coerce it to an IRanges object (with or without warning) thus > >> meaning these operations would not be strict endomorphisms > >> - Users would only get the space-savings of the FWRanges class if they > >> explicitly construct a FWRanges object or coerce a compatible IRanges > >> object to an FWRanges object > >> - Clean code separation from the IRanges class may also lead to > >> duplicated code > >> > >> # Cons of 1b compared to 1a > >> > >> - Endomorphic versions of methods like start<-, end<-, width<- could > >> create a @width slot that is twice the 'necessary' size (e.g., an Rle > >> representation of a vector that contains no 'runs'). > >> > >> # Pros of 2 > >> > >> - If properly implemented, the user wouldn't need to think about > >> whether the ranges were fixed or variable width, they'd just get the > >> most efficient representation > >> > >> # Cons of 2 > >> > >> - This is fairly obvious, 2 would be a major (internal) change to a > >> core Bioconductor class > >> - The @width slot would no longer necessarily be parallel to @start > >> slot, e.g., code that does direct slot access via @width could easily > >> break (of course, the width() getter would be modified to return a > >> parallel vector to the @start slot, but people (*cough* me) have code > >> that does the wrong thing with respect to the use of getters vs. > >> direct slot access) > >> - New IRanges objects may be incompatible with earlier version of > >> IRanges > >> > >> Your feedback is very appreciated, > >> Pete > >> > >> _______________________________________________ > >> Bioc-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel