On Tue, Oct 12, 2010 at 1:40 PM, Bond, Stephen <stephen.b...@cibc.com> wrote: > Hello everybody, > > Data is > myd <- data.frame(id1=rep(c("a","b","c"),each=3),id2=rep(1:3,3),val=rnorm(9)) > > I want to get a cumulative sum over each of id1. trying aggregate does not > work > > myd$pcum <- aggregate(myd[,c("val")],list(orig=myd$id1),cumsum) > > Please suggest a solution. In real the dataframe is huge so looping with for > and subsetting is not a great idea (still doable, though).
Looping can be slow but its not necessarily so. Here are three approaches to using ave with cumsum to solve this problem. The benchmark shows that the loop is actually the fastest: N <- 1e4 k <- 10 myd <- data.frame(id1=rep(letters[1:k],each=N),id2=rep(1:k,N),val=rnorm(k*N)) library(rbenchmark) benchmark(order = "relative", replications = 100, loop = { loop <- myd for(i in 2:3) loop[, i] <- ave(myd[, i], myd[, 1], FUN = cumsum) }, nonloop1 = { nonloop1 <- transform(myd, id2 = ave(id2, id1, FUN = cumsum), val = ave(val, id1, FUN = cumsum) )}, nonloop2 = { f <- function(i) ave(myd[, i], myd[, 1], FUN = cumsum) nonloop2 <- replace(myd, 2:3, lapply(2:3, f)) } ) identical(loop, nonloop1) identical(loop, nonloop2) The output on my laptop is: test replications elapsed relative user.self sys.self user.child sys.child 1 loop 100 8.52 1.000000 8.07 0.10 NA NA 3 nonloop2 100 8.94 1.049296 8.29 0.17 NA NA 2 nonloop1 100 11.65 1.367371 10.71 0.22 NA NA -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.