Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had.
system.time(replicate(500, filter(tmp, id == idList[1]))) I did this on the toy example as well as the real data, finding the same (slower) result each time compared to the indexing method. Perhaps I'm using it incorrectly? -----Original Message----- From: Constantin Weiser [mailto:constantin.wei...@hhu.de] Sent: Wednesday, September 28, 2016 12:55 PM To: r-help@r-project.org Cc: Doran, Harold <hdo...@air.org> Subject: Re: [R] Faster Subsetting I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis. And here are some possible solutions: tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling the dataset ## with specialized packages require(plyr) system.time({ res1 <- ddply(tmp, .(id), summarize, mean=mean(foo)) }) require(dplyr) system.time({ res2 <- tmp %>% group_by(id) %>% summarise(mean = mean(foo)) }) library(data.table) system.time({ res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id] }) ## build-in R-methods system.time({ res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean) }) system.time({ res5 <- sapply(unique(tmp$id), simplify = TRUE, FUN = function(x){ c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"])) }) }) res5 <- t(res5) system.time({ res5 <- sapply(unique(tmp$id), simplify = TRUE, FUN = function(x){ sub.tmp <- subset(tmp, tmp$id == x) c(x,mean=mean(sub.tmp[, "foo"])) }) }) res5 <- t(res5) Yours Constantin -- ^ | X | /eiser, Dr. Constantin (weis...@hhu.de) | /Chair of Statistics and Econometrics | / Heinrich Heine-University of Düsseldorf | * /\ / Universitätsstraße 1, 40225 Düsseldorf, Germany | \ / \ / Oeconomicum (Building 24.31), Room 01.22 | \/ \/ Tel: 0049 211 81-15307 +-----------------------------------------------------------> Am 28.09.2016 um 18:28 schrieb Doran, Harold: > Thank you very much. I don’t know tidyverse, I’ll look at that now. I > did some tests with data.table package, but it was much slower on my > machine, see examples below > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp, id == idList[1]))) > > > library(data.table) > > tmp2 <- as.data.table(tmp) # data.table > > system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp2, id == idList[1]))) > > From: Dominik Schneider [mailto:dosc3...@colorado.edu] > Sent: Wednesday, September 28, 2016 12:27 PM > To: Doran, Harold <hdo...@air.org> > Cc: r-help@r-project.org > Subject: Re: [R] Faster Subsetting > > I regularly crunch through this amount of data with tidyverse. You can also > try the data.table package. They are optimized for speed, as long as you have > the memory. > Dominik > > On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold > <hdo...@air.org<mailto:hdo...@air.org>> wrote: > I have an extremely large data frame (~13 million rows) that resembles the > structure of the object tmp below in the reproducible code. In my real data, > the variable, 'id' may or may not be ordered, but I think that is irrelevant. > > I have a process that requires subsetting the data by id and then running > each smaller data frame through a set of functions. One example below uses > indexing and the other uses an explicit call to subset(), both return the > same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of rows to > evaluate the condition and this is expensive and a bottleneck in my code. > I'm curious if anyone can recommend an improvement that would somehow be less > expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > > ______________________________________________ > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.