Re: [Bioc-devel] is.unsorted method for GRanges objects

Hervé Pagès Mon, 02 Nov 2015 22:47:48 -0800

OK. Thanks Pete for the timings. The fact that the relative difference
in speed is larger for small n in your brief tests is because one
performs roughly in n*log(n) (quicksort-based) and the other one is
linear in time. Which is why I assumed (but without doing any testing)
that the latter was going to perform better. Anyway it seems that there
is just too much overhead involved in that solution to make it a good
candidate.


So back to square one and to the business of trying to come up with
something even more efficient than is.unsorted(order(x)) for
GenomicRanges objects. It's indeed important that is.unsorted() be
as fast and as memory efficient as possible since it is typically
used as a quick/cheap way of checking whether a costly sort is
required or not (e.g. with something like if (is.unsorted(x))
x <- sort(x)).

So it seems that unfortunately we won't be able to do it without
writing some C code. Your proposal sounds very reasonable to me. It
will perform in linear time (in the worst case) and avoid any copy
of the object (that we get with the expensive calls to head() and
tail() in my solution). So will be much faster than the 2 R solutions
whatever n is. Should work on GenomicRanges objects, not just GRanges
(this is easily achieved by passing S4Vectors:::decodeRle(seqnames(x)),
start(x), with(x), and S4Vectors:::decodeRle(strand(x)) to the .Call
entry point).

I'll take your patch if you want to work on this or I can add this
to GenomicRanges, let me know. We should probably take this off-list.

Thanks,
H.

On 11/02/2015 09:43 PM, Peter Hickey wrote:

Thanks for everyones' input.

@Hervé FWIW, the below benchmark suggests that unfortunately this is a
fair bit slower than is.unsorted(order(gr)) when the length of the
GRanges object is < 10,000,000 (the relative difference in speed is
larger for small n in my brief tests; I didn't check above n >
10,000,000)

```r
# GenomicRanges_1.23.1
library(GenomicRanges)

# Simulate some random ranges
sim_gr <- function(n) {
   GRanges(seqnames = sample(paste0("chr", 1:22), n, replace = TRUE),
           ranges = IRanges(sample(n * 10, size = n, replace = TRUE),
                   width = runif(n, 1, 100000)),
           strand = sample(c("+", "-", "*"), n, replace = TRUE),
           seqinfo = Seqinfo(paste0("chr", 1:22)))

}

gr <- sim_gr(10000000)

herve <- function(x, na.rm=FALSE, strictly=FALSE) {
   if (length(x) <= 1L)
     return(FALSE)
   x1 <- head(x, n=-1)
   x2 <- tail(x, n=-1)
   if (strictly)
     return(any(x1 >= x2))
   any(x1 > x2)
}

# 22 seconds
system.time(herve(gr))
# 11.3 seconds
system.time(is.unsorted(order(gr)))

# And when it's already sorted
gr2 <- sort(gr)

# 4.3 seconds
system.time(herve(gr2))
# 0.2 seconds
system.time(is.unsorted(order(gr2)))
```

Roughly, it looks like the head(), tail() calls take approximately 1/4
of the time each, while the any() call takes the remaining 1/2 of the
time. I was thinking it might be possible to make this quite fast by
looping over the GRanges object at the C-level and breaking out of the
loop if gr[i+1] <= gr[i] or gr[i+1] < gr[i], as appropriate. Does this
sound reasonable?

Cheers,
Pete

On 3 November 2015 at 14:06, Michael Lawrence <lawrence.mich...@gene.com> wrote:



On Mon, Nov 2, 2015 at 6:39 PM, Hervé Pagès <hpa...@fredhutch.org> wrote:


Hi,

@Pete:

2a- I would just compare pairs of adjacent elements, taking
advantage of the fact that <= is vectorized and cheap. So something
like:

   setMethod("is.unsorted", "Vector",
     function(x, na.rm=FALSE, strictly=FALSE)
     {
         if (length(x) <= 1L)
             return(FALSE)
         x1 <- head(x, n=-1)
         x2 <- tail(x, n=-1)
         if (strictly)
             return(any(x1 >= x2))
         any(x1 > x2)
     }
   )

Since this will work on any Vector derivative for which <= and
subsetting are defined, it's a good candidate for being the default
"is.unsorted" method for Vector objects. I'll add it to S4Vectors.

2b- The semantic of is.unsorted() on a GRangesList object or any
List object in general should be sapply(x, is.unsorted), for
consistency with order(), sort(), etc:

   > sort(IntegerList(4:3, 1:-2))
   IntegerList of length 2
   [[1]] 3 4
   [[2]] -2 -1 0 1

I'll add this too.

2c - That won't be needed. The default method for Vector objects will
work on RangedSummarizedExperiment objects (<= and 1D subsetting are
defined and along the same dimension).

@Gabe:

See ?`GenomicRanges-comparison` for how the order of genomic ranges
is defined.

@Michael:

Calling base::.gt() in a loop sounds indeed very inefficient. What
about having base::is.unsorted() do the above on "objects" instead?
base::.gt() seems to also require that the object is subsettable so
the requirements are the same.

Then we wouldn't need the default "is.unsorted" method for Vector
objects, only a default "anyNA" method for Vector objects that always
returns FALSE (plus some specific ones for Rle and other Vector
derivatives that support NAs).


Yea, I assumed it did what you suggested before looking at it. It would be
the right place to fix this.


Thanks,
H.


On 11/02/2015 05:35 PM, Michael Lawrence wrote:


The notion of sortedness is already formally defined, which is why we
have
an order method, etc.

The base is.unsorted implementation for "objects" ends up calling
base::.gt() for each adjacent pair of elements, which is likely too slow
to
be practical, so we probably should add a custom method.

This does bring up the tangential question of whether GenomicRanges
should
have an anyNA method that returns FALSE (and similarly an is.na()
method),
although we have never defined the concept of a "missing range".

Michael

On Mon, Nov 2, 2015 at 4:55 PM, Gabe Becker <becker.g...@gene.com> wrote:

Pete,

What does sorted mean for granges? If the starts  are sorted but the
ends
aren't does that count? What if only the ends are but the ranges are on
the
negative strand?

Do we consider seqlevels to be ordinal in the order the levels are
returned
from seqlevels ()? That usually makes sense, but does it always?

In essence I'm asking if sortedness is a well enough defined term for an
is.sorted method to make sense.

Best,
~G
On Nov 2, 2015 4:27 PM, "Peter Hickey" <peter.hic...@gmail.com> wrote:

Hi all,

I sometimes want to test whether a GRanges object (or some object with
a GRanges slot, e.g., a SummarizedExperiment object) is (un)sorted.
There is no is.unsorted,GRanges-method or, rather, it defers to
is.unsorted,ANY-method. I'm unsure that deferring to the
is.unsorted,ANY-method is what is really desired when a user calls
is.unsorted on a GRanges object, and it will certainly return a
(possibly unrelated) warning - "In is.na(x) : is.na() applied to
non-(list or vector) of type 'S4'".


For this reason, I tend to use is.unsorted(order(x)) when x is a
GRanges object. This workaround is also used, for example, by minfi

(https://github.com/kasperdanielhansen/minfi/blob/master/R/blocks.R#L121

).


However, this is slow because it essentially sorts the object to test
whether it is already sorted.


So, to my questions:

1. Have I overlooked a fast way to test whether a GRanges object is


sorted?


2a. Could a is.unsorted,GenomicRanges-method be added to the
GenomicRanges package? Side note, I'm unsure at which level to define
this method, e.g., GRanges vs. GenomicRanges.
2b. Is it possible to have a sensible definition and implementation
for is.unsorted,GRangesList-method?
2c. Could a is.unsorted,RangedSummarizedExperiment-method be added to
the SummarizedExperiment package?

I started working on a patch for 2a/2c, but wanted to ensure I hadn't
overlooked something obvious. Also, I'm sure 2a/2b/2c could be written
much more efficiently at the C-level but I'm afraid this might be a
bit beyond my abilities to integrate nicely with the existing code.

Thanks,
Pete

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] is.unsorted method for GRanges objects

Reply via email to