On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <iz...@psych.rochester.edu> wrote:
> > > > However, I don't think you've told us what you're actually trying to > > accomplish... > > > I'm trying to aggregate the y value of a big data set which has several x's and a y. I'm using an abstracted example for many reasons. Partially, I'm using an abstracted example to comply with the posting guidelines of having a reproducible example. I'm really aggregating some incredibly boring and complex customer data for an undisclosed client. As it turns out, Aggregate will not work when some of x's are NA, unless you convert them to factors, with NA's included. In my case, the data is so big that doing the conversions causes other memory problems, and renders some of my numeric values useless for other calculations. My real data looks more like this (except with a few more categories and records): set.seed(100) library(plyr) dat=data.frame( x1=sample(c(NA,'m','f'), 2e6, replace=TRUE), x2=sample(c(NA, 1:10), 2e6, replace=TRUE), x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE), x4=sample(c(NA,T,F), 2e6, replace=TRUE), x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6, replace=TRUE), x6=sample(c(NA, 1:10), 2e6, replace=TRUE), x7=sample(c(NA,'married','divorced','separated','single','etc'), 2e6, replace=TRUE), x8=sample(c(NA,T,F), 2e6, replace=TRUE), y=trunc(rnorm(2e6)*10000), stringsAsFactors=F) str(dat) ## The control total sum(dat$y, na.rm=T) ## The aggregate total sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x) ## The ddply total sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x) {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum) ddply worked a little better than I expected at first, but it slows to a crawl or has runs out of memory too often for me to invest the time learning how to use it. Just now it worked for 1m records, and it was just a bit slower than aggregate. But for the 2m example it hasn't finished calculating. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.