Thanks for everyones' input. @Hervé FWIW, the below benchmark suggests that unfortunately this is a fair bit slower than is.unsorted(order(gr)) when the length of the GRanges object is < 10,000,000 (the relative difference in speed is larger for small n in my brief tests; I didn't check above n > 10,000,000)
```r # GenomicRanges_1.23.1 library(GenomicRanges) # Simulate some random ranges sim_gr <- function(n) { GRanges(seqnames = sample(paste0("chr", 1:22), n, replace = TRUE), ranges = IRanges(sample(n * 10, size = n, replace = TRUE), width = runif(n, 1, 100000)), strand = sample(c("+", "-", "*"), n, replace = TRUE), seqinfo = Seqinfo(paste0("chr", 1:22))) } gr <- sim_gr(10000000) herve <- function(x, na.rm=FALSE, strictly=FALSE) { if (length(x) <= 1L) return(FALSE) x1 <- head(x, n=-1) x2 <- tail(x, n=-1) if (strictly) return(any(x1 >= x2)) any(x1 > x2) } # 22 seconds system.time(herve(gr)) # 11.3 seconds system.time(is.unsorted(order(gr))) # And when it's already sorted gr2 <- sort(gr) # 4.3 seconds system.time(herve(gr2)) # 0.2 seconds system.time(is.unsorted(order(gr2))) ``` Roughly, it looks like the head(), tail() calls take approximately 1/4 of the time each, while the any() call takes the remaining 1/2 of the time. I was thinking it might be possible to make this quite fast by looping over the GRanges object at the C-level and breaking out of the loop if gr[i+1] <= gr[i] or gr[i+1] < gr[i], as appropriate. Does this sound reasonable? Cheers, Pete On 3 November 2015 at 14:06, Michael Lawrence <lawrence.mich...@gene.com> wrote: > > > On Mon, Nov 2, 2015 at 6:39 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: >> >> Hi, >> >> @Pete: >> >> 2a- I would just compare pairs of adjacent elements, taking >> advantage of the fact that <= is vectorized and cheap. So something >> like: >> >> setMethod("is.unsorted", "Vector", >> function(x, na.rm=FALSE, strictly=FALSE) >> { >> if (length(x) <= 1L) >> return(FALSE) >> x1 <- head(x, n=-1) >> x2 <- tail(x, n=-1) >> if (strictly) >> return(any(x1 >= x2)) >> any(x1 > x2) >> } >> ) >> >> Since this will work on any Vector derivative for which <= and >> subsetting are defined, it's a good candidate for being the default >> "is.unsorted" method for Vector objects. I'll add it to S4Vectors. >> >> 2b- The semantic of is.unsorted() on a GRangesList object or any >> List object in general should be sapply(x, is.unsorted), for >> consistency with order(), sort(), etc: >> >> > sort(IntegerList(4:3, 1:-2)) >> IntegerList of length 2 >> [[1]] 3 4 >> [[2]] -2 -1 0 1 >> >> I'll add this too. >> >> 2c - That won't be needed. The default method for Vector objects will >> work on RangedSummarizedExperiment objects (<= and 1D subsetting are >> defined and along the same dimension). >> >> @Gabe: >> >> See ?`GenomicRanges-comparison` for how the order of genomic ranges >> is defined. >> >> @Michael: >> >> Calling base::.gt() in a loop sounds indeed very inefficient. What >> about having base::is.unsorted() do the above on "objects" instead? >> base::.gt() seems to also require that the object is subsettable so >> the requirements are the same. >> >> Then we wouldn't need the default "is.unsorted" method for Vector >> objects, only a default "anyNA" method for Vector objects that always >> returns FALSE (plus some specific ones for Rle and other Vector >> derivatives that support NAs). >> > > Yea, I assumed it did what you suggested before looking at it. It would be > the right place to fix this. > >> >> Thanks, >> H. >> >> >> On 11/02/2015 05:35 PM, Michael Lawrence wrote: >>> >>> The notion of sortedness is already formally defined, which is why we >>> have >>> an order method, etc. >>> >>> The base is.unsorted implementation for "objects" ends up calling >>> base::.gt() for each adjacent pair of elements, which is likely too slow >>> to >>> be practical, so we probably should add a custom method. >>> >>> This does bring up the tangential question of whether GenomicRanges >>> should >>> have an anyNA method that returns FALSE (and similarly an is.na() >>> method), >>> although we have never defined the concept of a "missing range". >>> >>> Michael >>> >>> On Mon, Nov 2, 2015 at 4:55 PM, Gabe Becker <becker.g...@gene.com> wrote: >>> >>>> Pete, >>>> >>>> What does sorted mean for granges? If the starts are sorted but the >>>> ends >>>> aren't does that count? What if only the ends are but the ranges are on >>>> the >>>> negative strand? >>>> >>>> Do we consider seqlevels to be ordinal in the order the levels are >>>> returned >>>> from seqlevels ()? That usually makes sense, but does it always? >>>> >>>> In essence I'm asking if sortedness is a well enough defined term for an >>>> is.sorted method to make sense. >>>> >>>> Best, >>>> ~G >>>> On Nov 2, 2015 4:27 PM, "Peter Hickey" <peter.hic...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I sometimes want to test whether a GRanges object (or some object with >>>>> a GRanges slot, e.g., a SummarizedExperiment object) is (un)sorted. >>>>> There is no is.unsorted,GRanges-method or, rather, it defers to >>>>> is.unsorted,ANY-method. I'm unsure that deferring to the >>>>> is.unsorted,ANY-method is what is really desired when a user calls >>>>> is.unsorted on a GRanges object, and it will certainly return a >>>>> (possibly unrelated) warning - "In is.na(x) : is.na() applied to >>>>> non-(list or vector) of type 'S4'". >>>>> >>>>> >>>>> For this reason, I tend to use is.unsorted(order(x)) when x is a >>>>> GRanges object. This workaround is also used, for example, by minfi >>>>> >>>>> (https://github.com/kasperdanielhansen/minfi/blob/master/R/blocks.R#L121 >>>> >>>> ). >>>>> >>>>> However, this is slow because it essentially sorts the object to test >>>>> whether it is already sorted. >>>>> >>>>> >>>>> So, to my questions: >>>>> >>>>> 1. Have I overlooked a fast way to test whether a GRanges object is >>>> >>>> sorted? >>>>> >>>>> 2a. Could a is.unsorted,GenomicRanges-method be added to the >>>>> GenomicRanges package? Side note, I'm unsure at which level to define >>>>> this method, e.g., GRanges vs. GenomicRanges. >>>>> 2b. Is it possible to have a sensible definition and implementation >>>>> for is.unsorted,GRangesList-method? >>>>> 2c. Could a is.unsorted,RangedSummarizedExperiment-method be added to >>>>> the SummarizedExperiment package? >>>>> >>>>> I started working on a patch for 2a/2c, but wanted to ensure I hadn't >>>>> overlooked something obvious. Also, I'm sure 2a/2b/2c could be written >>>>> much more efficiently at the C-level but I'm afraid this might be a >>>>> bit beyond my abilities to integrate nicely with the existing code. >>>>> >>>>> Thanks, >>>>> Pete >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fredhutch.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel