Hi all,
After reading this interesting discussion I delved a bit deeper into the
subject matter. The following snippet of code (see the end of my mail)
compares three ways of performing this task, using ddply, ave and one
yet unmentioned option: data.table (a package). The piece of code
generates mock datasets which vary in size and number of factor levels
for the factor. The results look like this (there is also a ggplot plot
in the script that summarise the table):
> res
datsize noClasses tave tddply tdata.table
...note that I cut out part of the table for readability...
17 1e+07 10 9.160 3.500 1.064
18 1e+07 50 10.126 4.483 1.364
19 1e+07 100 10.485 5.016 1.407
20 1e+07 200 10.680 6.901 1.435
21 1e+07 500 10.801 12.569 1.474
22 1e+07 1000 10.923 21.001 1.540
23 1e+07 2500 11.514 51.020 1.622
24 1e+07 10000 12.158 182.752 1.737
It is clear that the option of using data.table is by far the fastest of
the three and scales quite nicely with the number of factor levels, in
contrast to ddply. It is also faster than ave by up to a factor of 10.
cheers,
Paul
library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5, 10e6)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
expdata = data.frame(value = runif(x$datsize),
cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
expdataDT = data.table(expdata)
t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum)))
t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val =
sum(value)))
t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
res
ggplot(aes(x = noClasses, y = log(value), color = variable),
data = melt(res, id.vars = c("datsize","noClasses"))) +
facet_wrap(~ datsize) + geom_line()
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] data.table_1.6.3 ggplot2_0.8.9 proto_0.3-8 reshape_0.8.4
[5] plyr_1.5.2 fortunes_1.4-1
loaded via a namespace (and not attached):
[1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0
On 08/03/2011 01:25 PM, Caroline Faisst wrote:
> Hello there,
>
>
> I'm computing the total value of an order from the price of the order items
> using a "for" loop and the "ifelse" function. I do this on a large dataframe
> (close to 1m lines). The computation of this function is painfully slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
> exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>
>
> # large dataset: the very same computational task takes much longer
>
> exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
>
> Does someone know a way to increase the speed?
>
>
> Thank you very much!
>
> Caroline
>
> [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494
http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.