On Tue, Oct 12, 2010 at 1:40 PM, Bond, Stephen <stephen.b...@cibc.com> wrote:
> Hello everybody,
>
> Data is
> myd <- data.frame(id1=rep(c("a","b","c"),each=3),id2=rep(1:3,3),val=rnorm(9))
>
> I want to get a cumulative sum over each of id1. trying aggregate does not 
> work
>
> myd$pcum <- aggregate(myd[,c("val")],list(orig=myd$id1),cumsum)
>
> Please suggest a solution. In real the dataframe is huge so looping with for 
> and subsetting is not a great idea (still doable, though).

Looping can be slow but its not necessarily so.  Here are three
approaches to using ave with cumsum to solve this problem.  The
benchmark shows that the  loop is actually the fastest:

N <- 1e4
k <- 10
myd <- data.frame(id1=rep(letters[1:k],each=N),id2=rep(1:k,N),val=rnorm(k*N))
library(rbenchmark)

benchmark(order = "relative", replications = 100,
  loop = { loop <- myd
    for(i in 2:3) loop[, i] <- ave(myd[, i], myd[, 1], FUN = cumsum)
  },
  nonloop1 = { nonloop1 <- transform(myd,
    id2 = ave(id2, id1, FUN = cumsum),
    val = ave(val, id1, FUN = cumsum)
  )},
  nonloop2 = {
    f <- function(i) ave(myd[, i], myd[, 1], FUN = cumsum)
    nonloop2 <- replace(myd, 2:3, lapply(2:3, f))
  }
)

identical(loop, nonloop1)
identical(loop, nonloop2)

The output on my laptop is:

      test replications elapsed relative user.self sys.self user.child sys.child
1     loop          100    8.52 1.000000      8.07     0.10         NA        NA
3 nonloop2          100    8.94 1.049296      8.29     0.17         NA        NA
2 nonloop1          100   11.65 1.367371     10.71     0.22         NA        NA

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to