Re: [R] Need very fast application of 'diff' - ideas?

Martin Morgan Sun, 29 Jan 2012 09:25:17 -0800

On 01/29/2012 08:12 AM, Kevin Ummel wrote:

Sorry, guys. I'm not active on the listserve, so my last post was held by the 
moderator until after Dirk's solution was posted.


Excellent stuff.

Is 'diff' really the bottleneck in your calculations? I would have saiddiff was in the class of 'fast' R calculations, so would expect manyother steps in a real analysis, including a poorly constructed input, tobe much more expensive.

Since speed is apparently of the essence, it makes sense to create ashared library and load that, rather than re-compiling it (via inline)each time.

The calculation is very straight-forward in C. It makes sense to use the'.Call' interface to avoid copying on the way in and out, and other Roverhead of the '.C' interface. A simple solution, assuming the correctargument type (numeric; the original post talks about integer values butthe values actually used (floor(x)) are numeric and presumably in aspeed-is-of-the-essence application the data would be created as the thetype of interest), no NAs, non-0 length input, etc., is (like Hans'solution, using the .C interface), in file cdiff.c:


#include <Rdefines.h>

SEXP cdiff(SEXP x)
{
    const int len = Rf_length(x) - 1;
    SEXP ans = Rf_allocVector(REALSXP, len);
    double *ansp = REAL(ans), *xp = REAL(x);
    for (int i = 0; i < len; ++i)
        ansp[i] = xp[i + 1] - xp[i];
    return ans;
}

I doubt that, with appropriate optimization flags for the compiler, useof [] vs. pointer arithmetic would make a difference. With compilation as


R CMD SHLIB cdiff.c

One would probably want to compile this with a high optimization, e.g.,from the 'Writing R Extensions' manual section 5.5


MAKEFLAGS="CFLAGS=-O3" R CMD SHLIB cdiff.c

Use as

    dyn.load("cdiff.so")
    ## ...
    x = rnorm(10000000)
    ans <- .Call("cdiff", x)

For me I get

> dyn.load("cdiff.so")
> system.time(x <- rnorm(10000000))
   user  system elapsed
  1.577   0.015   1.594
> system.time(ans0 <- diff(x))
   user  system elapsed
  0.842   0.110   0.952
> system.time(ans1 <- .Call("cdiff", x))
   user  system elapsed
  0.024   0.020   0.044
> identical(ans0, ans1)
[1] TRUE

Note that just creating random data already dominates the calculation,which is why diff seems such an unlikely candidate for a bottleneck. Iwould guess that obtaining memory for the answer is the big bottleneckin cdiff (or Rcpp); one could work around this but then violate R'smemory model. That I can write C code that is 20x faster than 'diff'comes at a significant price, in terms of error checking and, yes,development time.

The timing for python in the original post doesn't capture the full costof the calculation; it should include the cost of the subset and viewconstruction (or whatever the efficient pythonic paradigm is)


Martin


thanks,
kevin

On Jan 29, 2012, at 8:37 AM, R. Michael Weylandt wrote:

Have you not followed your own thread? Dirk is Mr. Rcpp himself and he
gives an implementation that gives you 25x improvement here as well as
tips for getting even more out of it:

http://tolstoy.newcastle.edu.au/R/e17/help/12/01/2471.html

Michael

On Sat, Jan 28, 2012 at 12:28 PM, Kevin Ummel<kevinum...@gmail.com>  wrote:

Thanks. I've played around with pure R solutions. The fastest re-write of diff 
(for the 1 lag case) I can seem to find is this:

diff2 = function(x) {
  y = c(x,NA) - c(NA,x)
  y[2:length(x)]
}

#Compiling via 'cmpfun' doesn't seem to help (or hurt):
require(compiler)
diff2 = cmpfun(diff2)

But that only gets ~10% improvement over default 'diff' on my machine. Still 
too slow for my particular application.

I'm inclined towards Michael's suggestion of inline+Rcpp (or some other use of 
C under the hood).

Could someone show me how to go about doing that?

Thanks!
Kevin

On Jan 28, 2012, at 9:14 AM, Peter Langfelder wrote:

ehm... this doesn't take very many ideas.


x = runif(n=10e6, min=0, max=1000)
x = round(x)

system.time( {
  y = x[-1] - x[-length(x)]
})

I get about 0.5 seconds on my old laptop.

HTH

Peter


On Fri, Jan 27, 2012 at 4:15 PM, Kevin Ummel<kevinum...@gmail.com>  wrote:

Hi everyone,

Speed is the key here.

I need to find the difference between a vector and its one-period lag (i.e. the 
difference between each value and the subsequent one in the vector). Let's say 
the vector contains 10 million random integers between 0 and 1,000. The 
solution vector will have 9,999,999 values, since their is no lag for the 1st 
observation.

In R we have:

#Set up input vector
x = runif(n=10e6, min=0, max=1000)
x = round(x)

#Find one-period difference
y = diff(x)

Question is: How can I get the 'diff(x)' part as fast as absolutely possible? I 
queried some colleagues who work with other languages, and they provided 
equivalent solutions in Python and Clojure that, on their machines, appear to 
be potentially much faster (I've put the code below in case anyone is 
interested). However, they mentioned that the overhead in passing the data 
between languages could kill any improvements. I don't have much experience 
integrating other languages, so I'm hoping the community has some ideas about 
how to approach this particular problem...

Many thanks,
Kevin

In iPython:

In [3]: import numpy as np
In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16")
In [5]: arr1 = arr[1:].view()
In [6]: timeit arr2 = arr1 - arr[:-1]
10 loops, best of 3: 20.1 ms per loop

In Clojure:

(defn subtract-lag
  [n]
  (let [v (take n (repeatedly rand))]
    (time (dorun (map - v (cons 0 v))))))





        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need very fast application of 'diff' - ideas?

Reply via email to