Hi: aggregate() is not well suited for summarization when the summary function takes multiple input arguments. Better choices for this type of summary are packages plyr, as David mentioned, and data.table.
Here's a toy example to illustrate. The fake data contain two grouping variables as factors and two continuous variables x and y. dd <- data.frame(gp1 = factor(rep(1:3, each = 10)), gp2 = factor(rep(rep(c('A', 'B'), each = 5), 3)), x = rnorm(30), y = rnorm(30)) Since I'm not terribly imaginative today (at least so far), the summary function I've created is a ratio estimator ybar/xbar, to be applied to all factor combinations. (I know I'm being lazy by not checking for division by zero or pairwise deletion in the case of missing values, but let's keep things simple.) myfun <- function(x, y) mean(y)/mean(x) This is a function of two variables. aggregate() and certain other summary functions (e.g., summaryBy() in the doBy package) won't handle this function well, as they are better suited for functions that take one argument and output the values of one or more functions - e.g., a function of x that outputs the mean, standard deviation, sample size and IQR. One approach is to use mapply(), and this would work after you got the data properly split by two factors into a list, but the authors of plyr and data.table have worked pretty hard to figure out how to simplify the process of summarizing data in various ways. Both can handle multiple argument functions quite easily and process them by one or more grouping variables. To get groupwise summaries in ddply() from plyr, use library(plyr) ddply(dd, .(gp1, gp2), summarise, ratio = myfun(x, y)) gp1 gp2 ratio 1 1 A 1.129818 2 1 B 241.458860 3 2 A -7.611629 4 2 B -0.842052 5 3 A -2.358191 6 3 B 5.764439 The summarise argument (which is safer to use since there is a summarize function in package Hmisc, and they sometimes mask each other) is needed to get the ratio; without it, ddply() will just return the data without computing the function. To do the same thing in data.table, we first load the package, convert the data frame to a data.table, and then summarize: library(data.table) dt <- data.table(dd) dt[, myfun(x, y), by = 'gp1, gp2'] gp1 gp2 V1 [1,] 1 A 1.129818 [2,] 1 B 241.458860 [3,] 2 A -7.611629 [4,] 2 B -0.842052 [5,] 3 A -2.358191 [6,] 3 B 5.764439 The comma after the left bracket is necessary. Running myfun() is an example of a 'J' operation (a body of code applied to columns); the first is an 'I' operation, which operates on rows (e.g., subsetting or SQL-like operations on the rows per se). Since we did nothing on the columns, that part is empty, just as if we did no selection of rows in a data frame or matrix but did something on the columns. The by groups are comma-delimited within quotes because they are factors. A safer approach is to use a data.table key, but we can get away without it in this example. See the package documentation for further (and clearer) details. HTH, Dennis On Wed, Sep 1, 2010 at 3:47 PM, David Winsemius <dwinsem...@comcast.net>wrote: > > On Sep 1, 2010, at 6:29 PM, Adrian Ng wrote: > > Dear R-Users, >> >> I have been using R for about 1 week and recently learned about the >> Aggregate function, and from reading online it seems like it is comparable >> to a SQL group by to perform summary functions. I am looking to do a >> summary function, such as SUM/MIN, but instead of these basic functions, I >> would like to calculate the IRR. >> >> Assuming I have a data set DS01: >> FirstName LastName NCF Date >> A B -100 >> 1/1/2001 >> A B 50 >> 2/1/2002 >> A B 200 >> 3/1/2003 >> A C -500 >> 1/1/2001 >> A C 50 >> 2/1/2002 >> A C 70 >> 3/1/2003 >> A C 50 >> 2/1/2004 >> A C 70 >> 3/1/2005 >> >> And an IRR function which takes in a cash flow and dates as inputs: >> IRR(NCF,cfDate) and returns the IRR >> >> I tried the following: >> aggregate(DS01$NCF,by=list(DS01$ FirstName,DS01$ >> LastName),RR,cfDate=DS01$Date) >> > > Since you are not forthcoming with the code for IRR (nor did you even spell > it correctly in your aggregate call), I will simple suggest that you > consider using the functions from the plyr package or if you wanted to do it > the old way you could look up split() and do : > > lapply (split(DS01, list(DS01$ FirstName,DS01$ LastName), IRR(NCF, > cfDate=Date) ). > > Plyr method (perhaps): > > require(plyr) > ddply(DS01, .( FirstName, LastName), function(df) IRR(NCF, cfDate=Date) ) > > I'm quite sure that the data.table functionality could handle it quite > easily as well. > > > > >> and I got the following error: >> Error in aggregate.data.frame(as.data.frame(x), ...) : >> arguments must have same length >> >> If anyone could shed some light on what may be causing this that would be >> great. >> > > aggregate() takes a vector while you need to pass a more complex structure. > I think you passed the entire length of Date. > > > > More importantly, is this the right way to do this (should I even be using >> the aggregate function)? >> >> Any help would be greatly appreciated! >> Thanks! >> >> >> >> >> >> Please consider the environment before printing this e-mail. >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > David Winsemius, MD > West Hartford, CT > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.