And since as.integer(cut(x,bins)) is essentially findInterval(x,bins) (since we throw away the labels made by cut()), I tried using findInterval instead of cut() and it cut the time by more than half, so your 5.0 s. is now c. 0.1 s. f3 <- function (m, bins) { nbins <- length(bins) - 1L m <- array(findInterval(m, bins), dim = dim(m)) t(apply(m, 1, tabulate, nbins = nbins)) } > system.time(r3 <- f3(m1,bins)) user system elapsed 0.09 0.00 0.09 > identical(r0,r3) [1] TRUE
Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, May 2, 2014 at 9:23 AM, William Dunlap <wdun...@tibco.com> wrote: > Your original code, as a function of 'm' and 'bins' is > f0 <- function (m, bins) { > t(apply(m, 1, function(x) hist(x, breaks = bins, plot = FALSE)$counts)) > } > and the time it takes to run on your m1 is about 5 s. on my machine >> system.time(r0 <- f0(m1,bins)) > user system elapsed > 4.95 0.00 5.02 > > > hist(x,breaks=bins) is essentially tabulate(cut(x,bins),nbins=length(bins)-1). > See how much it speeds things up by replacing hist() with tabulate(cut()): > f1 <- function (m, bins) > { > nbins <- length(bins) - 1L > t(apply(m, 1, function(x) tabulate(cut(x, bins), nbins = nbins))) > } > That doesn't help with the time, but it does give the same output >> system.time(r1 <- f1(m1,bins)) > user system elapsed > 4.85 0.10 5.35 >> identical(r0, r1) > [1] TRUE > > Now try speeding it up by calling cut() on the whole matrix first > and then applying tabulate to each row, as in > f2 <- function (m, bins) { > nbins <- length(bins) - 1L > m <- array(as.integer(cut(m, bins)), dim = dim(m)) > t(apply(m, 1, tabulate, nbins = nbins)) > } > That saves quite a bit of time and gives the same output >> system.time(r2 <- f2(m1,bins)) > user system elapsed > 0.25 0.00 0.25 >> identical(r0, r2) > [1] TRUE > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Thu, May 1, 2014 at 12:48 PM, Ortiz-Bobea, Ariel <ortiz-bo...@rff.org> > wrote: >> Hello everyone, >> >> >> >> I'm trying to construct bins for each row in a matrix. I'm using apply() in >> combination with hist() to do this. Performing this binning for a 10K-by-50 >> matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. >> This suggests the bottleneck is accessing rows in apply() rather than the >> calculations going on inside hist(). >> >> >> >> My initial idea is to process as many columns (as make sense for the >> intended use) at once. However, I still have many many rows to process and I >> would appreciate any feedback on how to speed this up. >> >> >> >> Any thoughts? >> >> >> >> Thanks, >> >> >> >> Ariel >> >> >> >> Here is the illustration: >> >> >> >> # create data >> >> m1 <- matrix(10*rnorm(50*10^4), ncol=50) >> >> m2 <- matrix(10*rnorm(50*10^4), ncol=500) >> >> >> >> # compute bins >> >> bins <- seq(-100,100,1) >> >> system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins, >> plot=FALSE)$counts)) }) >> >> system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins, >> plot=FALSE)$counts)) }) >> >> >> --- >> Ariel Ortiz-Bobea >> Fellow >> Resources for the Future >> 1616 P Street, N.W. >> Washington, DC 20036 >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.