On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <iz...@psych.rochester.edu> wrote:

> >
> > However, I don't think you've told us what you're actually trying to
> > accomplish...
> >
>

I'm trying to aggregate the y value of a big data set which has several x's
and a y.
I'm using an abstracted example for many reasons.  Partially, I'm using an
abstracted example to comply with the posting guidelines of having a
reproducible example.  I'm really aggregating some incredibly boring and
complex customer data for an undisclosed client.

As it turns out,
Aggregate will not work when some of x's are NA, unless you convert them to
factors, with NA's included.

In my case, the data is so big that doing the conversions causes other
memory problems, and renders some of my numeric values useless for other
calculations.

My real data looks more like this (except with a few more categories and
records):

set.seed(100)
library(plyr)
dat=data.frame(
        x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
        x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
        x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
        x4=sample(c(NA,T,F), 2e6, replace=TRUE),
        x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
replace=TRUE),
        x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
        x7=sample(c(NA,'married','divorced','separated','single','etc'),
2e6, replace=TRUE),
        x8=sample(c(NA,T,F), 2e6, replace=TRUE),
        y=trunc(rnorm(2e6)*10000), stringsAsFactors=F)
str(dat)
## The control total
sum(dat$y, na.rm=T)
## The aggregate total
sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x)
## The ddply total
sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x)
        {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum)

ddply worked a little better than I expected at first, but it slows to a
crawl or has runs out of memory too often for me to invest the time learning
how to use it.  Just now it worked for 1m records, and it was just a bit
slower than aggregate.  But for the 2m example it hasn't finished
calculating.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to