On 9/30/2005 1:41 PM, Peter Dalgaard wrote: > Duncan Murdoch <[EMAIL PROTECTED]> writes: > >> I want to calculate a statistic on a number of subgroups of a dataframe, >> then put the results into a dataframe. (What SAS PROC MEANS does, I >> think, though it's been years since I used it.) >> >> This is possible using by(), but it seems cumbersome and fragile. Is >> there a more straightforward way than this? >> >> Here's a simple example showing my current strategy: >> >> > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, >> c(2,2,2,2)), value = rnorm(8)) >> > dataset >> gp1 gp2 value >> 1 1 1 0.9493232 >> 2 1 1 -0.0474712 >> 3 1 2 -0.6808021 >> 4 1 2 1.9894999 >> 5 2 3 2.0154786 >> 6 2 3 0.4333056 >> 7 2 4 -0.4746228 >> 8 2 4 0.6017522 >> > >> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1], >> + gp2 = subset$gp2[1], statistic = mean(subset$value)) >> > >> > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup) >> > >> > result <- do.call('rbind', bylist) >> > result >> gp1 gp2 statistic >> 1 1 1 0.45092598 >> 11 1 2 0.65434890 >> 12 2 3 1.22439210 >> 13 2 4 0.06356469 >> >> tapply() is inappropriate because I don't have all possible combinations >> of gp1 and gp2 values, only some of them: >> >> > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean) >> 1 2 3 4 >> 1 0.450926 0.6543489 NA NA >> 2 NA NA 1.224392 0.06356469 >> >> >> >> In the real case, I only have a very sparse subset of all the >> combinations, and tapply() and by() both die for lack of memory. >> >> Any suggestions on how to do what I want, without using SAS? > > Have you tried aggregate()?
aggregate() has a few problems: - it applies the function to every column in the dataframe. In my case it only makes sense to apply it to some of them. (This may not be a killer, but it certainly makes things inefficient and tricky.) - I'd like to look at the whole subset to figure out the function (but I can probably work around this) - It uses too much memory. E.g. try > df <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000), w=rnorm(1000)) > aggregate(df, list(df$x,df$y,df$z), mean) Error: cannot allocate vector of size 3906250 Kb In addition: Warning messages: 1: Reached total allocation of 1007Mb: see help(memory.size) 2: Reached total allocation of 1007Mb: see help(memory.size) This should have returned the same dataframe (there are 1000 subsets), but it tried to construct a billion of them. On 9/30/2005 1:48 PM, Don MacQueen wrote: > Look at the summarize() function in the Hmisc package. It seems to want a matrix, not a data.frame. The real situation has mixed types (character, factors, numeric) so it can't be a matrix. > (and I this is an r-help question, not an r-devel question, I would think) Yes, that's where I should have posted. Sorry. However, this is starting to look like a development problem... Peter again: > Alternatively, you migth split on interaction(...., drop=TRUE) Looking at the code, it appears that will construct the full product interaction, then subset to the non-empty cases... Yes, it does that. Looks like I'll have to write my own. Duncan ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel