Re: [R] slow computation of functions over large datasets

Paul Hiemstra Thu, 04 Aug 2011 07:11:41 -0700

 Hi all,

After reading this interesting discussion I delved a bit deeper into the
subject matter. The following snippet of code (see the end of my mail)
compares three ways of performing this task, using ddply, ave and one
yet unmentioned option: data.table (a package). The piece of code
generates mock datasets which vary in size and number of factor levels
for the factor. The results look like this (there is also a ggplot plot
in the script that summarise the table):


> res
   datsize noClasses   tave  tddply tdata.table
...note that I cut out part of the table for readability...
17   1e+07        10  9.160   3.500       1.064
18   1e+07        50 10.126   4.483       1.364
19   1e+07       100 10.485   5.016       1.407
20   1e+07       200 10.680   6.901       1.435
21   1e+07       500 10.801  12.569       1.474
22   1e+07      1000 10.923  21.001       1.540
23   1e+07      2500 11.514  51.020       1.622
24   1e+07     10000 12.158 182.752       1.737

It is clear that the option of using data.table is by far the fastest of
the three and scales quite nicely with the number of factor levels, in
contrast to ddply. It is also faster than ave by up to a factor of 10.

cheers,
Paul

library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5, 10e6)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
              cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
  expdataDT = data.table(expdata)
  t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum)))
  t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val =
sum(value)))
  t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
  return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
res

ggplot(aes(x = noClasses, y = log(value), color = variable),
        data = melt(res, id.vars = c("datsize","noClasses"))) +
    facet_wrap(~ datsize) + geom_line()

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods 
[8] base    

other attached packages:
[1] data.table_1.6.3 ggplot2_0.8.9    proto_0.3-8      reshape_0.8.4  
[5] plyr_1.5.2       fortunes_1.4-1 

loaded via a namespace (and not attached):
[1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0


On 08/03/2011 01:25 PM, Caroline Faisst wrote:
> Hello there,
>
>
> I'm computing the total value of an order from the price of the order items
> using a "for" loop and the "ifelse" function. I do this on a large dataframe
> (close to 1m lines). The computation of this function is painfully slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
> exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>
>
> # large dataset: the very same computational task takes much longer
>
> exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
>
> Does someone know a way to increase the speed?
>
>
> Thank you very much!
>
> Caroline
>
>       [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770


        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] slow computation of functions over large datasets

Reply via email to