Yes, we could use this offset-based approach. Though we would want to just use the seqlengths for calculating the offsets, so that every query will be compatible with the index. This would usually mean doubles, instead of integers. The hash approach might be a tiny bit faster, as we wouldn't need to transform, and would quickly get to a smaller tree, although this is not a huge benefit given the logarithmic time complexity of queries.
Michael On Wed, Apr 3, 2013 at 11:12 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 04/03/2013 10:28 AM, Kasper Daniel Hansen wrote: > >> Making IntervalTree chromosome would also be a great addition for >> organisms >> with many sequences, like bee (due to an incomplete genome; 10,000s of >> sequences). It does not matter for humans, but findOverlaps is >> excruciatingly slow for bee's. I have a couple of posts on this in the >> archive. >> >> I am predicting this will be the case in the future for most non-model >> organisms; finishing a genome is expensive and time consuming. >> > > A cheap (both computationally and philosophically) hack is along the lines > of > > gr <- GRanges(c("A", "B", "A"), IRanges(c(2000, 1000, 3000), width=100)) > > rng <- range(gr) > off0 <- (width(rng)[-length(rng)] + 1L) > offset <- setNames(c(1L, cumsum(off0)) - start(rng), seqnames(rng)) > shift(ranges(gr), offset[as.character(seqnames(**gr))]) > > which shifts the ranges to be non-overlapping by seqname, so findOverlaps > can operate on the IRanges directly. This is a trick I used in > selectMethod(disjoin, "CompressedIRangesList"). There's the risk of integer > overflow (when sum(width(rng) + 1L) > .Machine$integer.max) but this could > be handled (if it occurs!) by partitioning into .Machine$integer.max-sized > ranges. Unless IRanges were updated to support longer integers... > > Martin > > > >> Kasper >> >> >> On Wed, Apr 3, 2013 at 12:45 PM, Michael Lawrence < >> lawrence.mich...@gene.com >> >>> wrote: >>> >> >> Some ideas: >>> >>> - Turn the IntervalTree into a list/array of nodes that can be >>> subset/reordered with shallow copying (just copy the pointers to the >>> nodes), and the index would be secondary. The index in the array could be >>> stored in each node, for lookup during overlap queries. Right now, as far >>> as I can tell, GIntervalTree will get confused if the user reorders e.g. >>> via [. >>> >>> - Make IntervalTree aware of the sequence/chromosome, e.g., have a hash >>> of >>> trees, which is trivial since seqnames is already a factor. >>> >>> Michael >>> >>> >>> >>> On Wed, Apr 3, 2013 at 9:29 AM, Hector Corrada Bravo <hcorr...@gmail.com >>> >>>> wrote: >>>> >>> >>> Yep, I didn't comment on that, but I agree that abstracting how >>>> GRanges stores ranges would make this more elegant. Right now >>>> ranges(GRanges) is specified to be of IRanges class instead of the >>>> abstract Ranges class. >>>> >>>> If it were the latter then GIntervalTree can be a subclass of >>>> GenomicRanges, in a similar way that IntervalTree is a subclass of >>>> Ranges. >>>> >>>> On Wed, Apr 3, 2013 at 12:23 PM, Michael Lawrence >>>> <lawrence.mich...@gene.com> wrote: >>>> >>>>> Hi Hector, >>>>> >>>>> That's interesting, thanks for passing this along. I'm still wishing >>>>> >>>> that >>> >>>> somehow GRanges itself could abstract the way it stores ranges. I know >>>>> >>>> that >>>> >>>>> Herve/Patrick had some reasons for depending specifically on GRanges. >>>>> >>>> One >>> >>>> reason was probably convenience at the C level, but it wouldn't be hard >>>>> >>>> to >>>> >>>>> create a Ranges abstraction at the C level, as well. >>>>> >>>>> Michael >>>>> >>>>> >>>>> >>>>> On Tue, Apr 2, 2013 at 5:40 PM, Hector Corrada Bravo < >>>>> >>>> hcorr...@gmail.com >>> >>>> >>>>> wrote: >>>>> >>>>>> >>>>>> Hello bioc-develers, >>>>>> >>>>>> I'm writing an application where lots findOverlap calls are made on >>>>>> static GRanges objects. For IRanges we can create persistent >>>>>> IntervalTree objects that would serve the multiple overlap query >>>>>> use-case. There is no equivalent for GenomicRanges objects, so I'm >>>>>> proposing an implementation for this. >>>>>> >>>>>> Please check >>>>>> http://github.com/hcorrada/**GenomicIntervalTree<http://github.com/hcorrada/GenomicIntervalTree> >>>>>> >>>>>> There's a first cut implementation there you can test by installing >>>>>> this skeleton package. E.g, >>>>>> >>>>>> library(devtools) >>>>>>> install_github("**GenomicIntervalTree", username="hcorrada", >>>>>>> >>>>>> subdir="pkg") >>>> >>>>> library(GenomicIntervalTree) >>>>>>> >>>>>> >>>>>> Let me know what you think. >>>>>> >>>>>> Cheers, >>>>>> Hector >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioc-devel@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>>>> >>>>> >>>>> >>>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>> >>> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel