Re: [R] Faster Subsetting

Doran, Harold Wed, 28 Sep 2016 10:09:44 -0700

Many thanks. I did also try the filter function in dplyr and was also much 
slower than simply indexing in the original way the code had.


system.time(replicate(500, filter(tmp, id == idList[1])))

I did this on the toy example as well as the real data, finding the same 
(slower) result each time compared to the indexing method.

Perhaps I'm using it incorrectly?



-----Original Message-----
From: Constantin Weiser [mailto:constantin.wei...@hhu.de] 
Sent: Wednesday, September 28, 2016 12:55 PM
To: r-help@r-project.org
Cc: Doran, Harold <hdo...@air.org>
Subject: Re: [R] Faster Subsetting

I just modified the reproducible example a bit, so it's a bit more realistic. 
The function "mean" could be "easily" replaced by your analysis.

And here are some possible solutions:

tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(200000)) tmp <- 
tmp[sample(dim(tmp)[1]),] # re-sampling the dataset

## with specialized packages
require(plyr)
system.time({
   res1 <- ddply(tmp, .(id), summarize, mean=mean(foo))
})

require(dplyr)
system.time({
   res2 <- tmp %>%
     group_by(id) %>%
     summarise(mean = mean(foo))
})

library(data.table)
system.time({
   res3 <- data.table(tmp)[, list(mean=mean(foo)), by=id]
})


## build-in R-methods
system.time({
   res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean)
})

system.time({
   res5 <- sapply(unique(tmp$id), simplify = TRUE,
                  FUN = function(x){
                    c(id=x, mean=mean(tmp[which(tmp$id == x), "foo"]))
                  })
})
res5 <- t(res5)

system.time({
   res5 <- sapply(unique(tmp$id), simplify = TRUE,
                  FUN = function(x){
                    sub.tmp <- subset(tmp, tmp$id == x)
                    c(x,mean=mean(sub.tmp[, "foo"]))
                  })
})
res5 <- t(res5)


Yours
Constantin


--
^
|                X
|               /eiser, Dr. Constantin (weis...@hhu.de)
|              /Chair of Statistics and Econometrics
|             / Heinrich Heine-University of Düsseldorf
| *    /\    /  Universitätsstraße 1, 40225 Düsseldorf, Germany
|  \  /  \  /   Oeconomicum (Building 24.31), Room 01.22
|   \/    \/    Tel: 0049 211 81-15307
+----------------------------------------------------------->

Am 28.09.2016 um 18:28 schrieb Doran, Harold:
> Thank you very much. I don’t know tidyverse, I’ll look at that now. I 
> did some tests with data.table package, but it was much slower on my 
> machine, see examples below
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
>
> library(data.table)
>
> tmp2 <- as.data.table(tmp)     # data.table
>
> system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))
>
> system.time(replicate(500, subset(tmp2, id == idList[1])))
>
> From: Dominik Schneider [mailto:dosc3...@colorado.edu]
> Sent: Wednesday, September 28, 2016 12:27 PM
> To: Doran, Harold <hdo...@air.org>
> Cc: r-help@r-project.org
> Subject: Re: [R] Faster Subsetting
>
> I regularly crunch through this amount of data with tidyverse. You can also 
> try the data.table package. They are optimized for speed, as long as you have 
> the memory.
> Dominik
>
> On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold 
> <hdo...@air.org<mailto:hdo...@air.org>> wrote:
> I have an extremely large data frame (~13 million rows) that resembles the 
> structure of the object tmp below in the reproducible code. In my real data, 
> the variable, 'id' may or may not be ordered, but I think that is irrelevant.
>
> I have a process that requires subsetting the data by id and then running 
> each smaller data frame through a set of functions. One example below uses 
> indexing and the other uses an explicit call to subset(), both return the 
> same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows to 
> evaluate the condition and this is expensive and a bottleneck in my code.  
> I'm curious if anyone can recommend an improvement that would somehow be less 
> expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> ______________________________________________
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
> UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>       [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Faster Subsetting

Reply via email to