Hi:

aggregate() is not well suited for summarization when the summary function
takes multiple input arguments. Better choices for this type of summary are
packages plyr, as David mentioned, and data.table.

Here's a toy example to illustrate. The fake data contain two grouping
variables as factors and two continuous variables x and y.

dd <- data.frame(gp1 = factor(rep(1:3, each = 10)),
                 gp2 = factor(rep(rep(c('A', 'B'), each = 5), 3)),
                  x = rnorm(30), y = rnorm(30))

Since I'm not terribly imaginative today (at least so far), the summary
function I've created is a ratio estimator ybar/xbar, to be applied to all
factor combinations. (I know I'm being lazy by not checking for division by
zero or pairwise deletion in the case of missing values, but let's keep
things simple.)

myfun <- function(x, y) mean(y)/mean(x)

This is a function of two variables. aggregate() and certain other summary
functions (e.g., summaryBy() in the doBy package) won't handle this function
well, as they are better suited for functions that take one argument and
output the values of one or more functions - e.g., a function of x that
outputs the mean, standard deviation, sample size and IQR.

One approach is to use mapply(), and this would work after you got the data
properly split by two factors into a list, but the authors of plyr and
data.table have worked pretty hard to figure out how to simplify the process
of summarizing data in various ways. Both can handle multiple argument
functions quite easily and process them by one or more grouping variables.

To get groupwise summaries in ddply() from plyr, use

library(plyr)
ddply(dd, .(gp1, gp2), summarise, ratio = myfun(x, y))
  gp1 gp2      ratio
1   1   A   1.129818
2   1   B 241.458860
3   2   A  -7.611629
4   2   B  -0.842052
5   3   A  -2.358191
6   3   B   5.764439

The summarise argument (which is safer to use since there is a summarize
function in package Hmisc, and they sometimes mask each other) is needed to
get the ratio; without it, ddply() will just return the data without
computing the function.

To do the same thing in data.table, we first load the package, convert the
data frame to a data.table, and then summarize:

library(data.table)
dt <- data.table(dd)
dt[, myfun(x, y), by = 'gp1, gp2']
     gp1 gp2         V1
[1,]   1   A   1.129818
[2,]   1   B 241.458860
[3,]   2   A  -7.611629
[4,]   2   B  -0.842052
[5,]   3   A  -2.358191
[6,]   3   B   5.764439

The comma after the left bracket is necessary. Running myfun() is an example
of a 'J' operation (a body of code applied to columns); the first is an 'I'
operation, which operates on rows (e.g., subsetting or SQL-like operations
on the rows per se). Since we did nothing on the columns, that part is
empty, just as if we did no selection of rows in a data frame or matrix but
did something on the columns. The by groups are comma-delimited within
quotes because they are factors. A safer approach is to use a data.table
key, but we can get away without it in this example. See the package
documentation for further (and clearer) details.

HTH,
Dennis



On Wed, Sep 1, 2010 at 3:47 PM, David Winsemius <dwinsem...@comcast.net>wrote:

>
> On Sep 1, 2010, at 6:29 PM, Adrian Ng wrote:
>
>  Dear R-Users,
>>
>> I have been using R for about 1 week and recently learned about the
>> Aggregate function, and from reading online it seems like it is comparable
>> to a SQL group by to perform summary functions.  I am looking to do a
>> summary function, such as SUM/MIN, but instead of these basic functions, I
>> would like to calculate the IRR.
>>
>> Assuming I have a data set DS01:
>> FirstName         LastName         NCF      Date
>> A                      B                      -100
>>  1/1/2001
>> A                      B                      50
>> 2/1/2002
>> A                      B                      200
>> 3/1/2003
>> A                      C                      -500
>>  1/1/2001
>> A                      C                      50
>> 2/1/2002
>> A                      C                      70
>> 3/1/2003
>> A                      C                      50
>> 2/1/2004
>> A                      C                      70
>> 3/1/2005
>>
>> And an IRR function which takes in a cash flow and dates as inputs:
>> IRR(NCF,cfDate) and returns the IRR
>>
>> I tried the following:
>> aggregate(DS01$NCF,by=list(DS01$ FirstName,DS01$
>> LastName),RR,cfDate=DS01$Date)
>>
>
> Since you are not forthcoming with the code for IRR (nor did you even spell
> it correctly in your aggregate call), I will simple suggest that you
> consider using the functions from the plyr package or if you wanted to do it
> the old way you could look up split() and do :
>
> lapply (split(DS01, list(DS01$ FirstName,DS01$ LastName), IRR(NCF,
> cfDate=Date) ).
>
> Plyr method (perhaps):
>
> require(plyr)
> ddply(DS01, .( FirstName, LastName), function(df) IRR(NCF, cfDate=Date) )
>
> I'm quite sure that the data.table functionality could handle it quite
> easily as well.
>
>
>
>
>> and I got the following error:
>> Error in aggregate.data.frame(as.data.frame(x), ...) :
>>  arguments must have same length
>>
>> If anyone could shed some light on what may be causing this that would be
>> great.
>>
>
> aggregate() takes a vector while you need to pass a more complex structure.
> I think you passed the entire length of Date.
>
>
>
>  More importantly, is this the right way to do this (should I even be using
>> the aggregate function)?
>>
>> Any help would be greatly appreciated!
>> Thanks!
>>
>>
>>
>>
>>
>> Please consider the environment before printing this e-mail.
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to