Re: [R] meaning of formula in aggregate function

P Ehlers Sun, 23 Jan 2011 05:39:49 -0800

Den wrote:

Dear Dennis
Thank you very much for your comprehensive reply and for time you've
spent dealing with my e-mail.
Your kindly explanation made things clearer for me.After your explanation it looks simple.
lapply with chosen options takes small part of cycle<n> with same id
(eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
characters.The only thing I still don't get is why how this code get rid out of
NAs, but this is rather minor technical issue. Main question for me was
in formula. You helped me indeed.


Okay, now I see what you're asking regarding the NAs.
I should have realized it before. Anyway, the answer
is in the function sort(). Have a look at its help
page and note what sort does when 'na.last=NA', the
default. You'll see where the NAs went.

Peter Ehlers

Thank you again
Have a nice day
Denis

From bending but not broken Belarus

У Суб, 22/01/2011 у 17:55 -0800, Dennis Murphy піша:

Hi:

I wouldn't pretend to speak for Henrique, but I'll give it a shot.

On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiew...@gmail.com> wrote:
        Dear R community
        Recently, dear Henrique Dallazuanna literally saved me solving
        one
        problem on data transformation which follows:

(n_, _n, j_, k_ signify numbers)SOURCE DATA:

        id      cycle1  cycle2  cycle3  …       cycle_n
        1       c       c       c               c
        1       m       m       m               m
        1       f       f       f               f
        2       m       m       m               NA
        2       f       f       f               NA
        2       c       c       c               NA
        3       a       a       NA              NA
        3       c       c       c               NA
        3       f       f       f               NA
        3       NA      NA      m               NA
        ...........................................

Q: How to transform source data to:

        RESULT DATA:
        id      cyc1    cyc2    cyc3    …       cyc_n
        1       cfm     cfm     cfm             cfm
        2       cfm     cfm     cfm
        3       acf     acf     cfm
        ...........................................

The Henrique's solution is:aggregate(.~ id, lapply(df, as.character), FUN =

        function(x)paste(sort(x), collapse = ''), na.action = na.pass)

The first part, . ~ id, is the formula. It's using every available
variable in the input data on the left hand side of the formula except
for id, which is the grouping variable.

The data object is lapply(df, as.character), which is a list object
that translates every element to character. I'm guessing that each
element of the list is a character string or list of character
strings, but I'm not sure. It looks like the individual characters of
each cycle comprise a list component within id. (??)  [My guess: the
result of lapply() is a list of lists. The top-level list components
correspond to the id's, while the second-level components are the
cycle variables, whose elements are the characters in each cycle
variable for each row with the same id.]

The function to be applied to each id is described in FUN. As Peter
mentioned, it's an 'anonymous' function, which means it is defined
in-line. In this case, a generic input object x has its elements
sorted in increasing order and then combines the elements into a
single string (the purpose of collapse = ); NA values are skipped
over. Thus, if my hypothesis about the structure of the list is
correct, the three characters in each cycle/id combination are first
sorted and then combined into a single string, which is then output as
the result. By the way that Henrique used the formula, the aggregate()
function will march through each cycle variable within id and execute

the function, and then iterate the process over all id's.Could somebody EXPLAIN HOW IT WORKS?

        I mean Henrique saved my investigation indeed.
        However, considering the fact, that I am about to perform
        investigation
        of cancer chemotherapy in 500 patients, it would be nice to
        know what
        I am actually doing.

Henrique's R knowledge is on a different level from most of us, so I

understand your question :)1. All help says about LHS in formulas like '.~id' is that

        it's
        name is "dot notation". And not a single word more. Thus, I
        have no
        clue, what dot in that formula really means.

. is shorthand for 'everything not otherwise specified in the model
formula'. In this case, it represents the entire set of cycle
variables.

        2. help says:
         Note that ‘paste()’ coerces ‘NA_character_’, the character
        missing
        value, to ‘"NA"'
        And at the same time:
         ‘na.pass’ returns the object unchanged.
        I am happy, that I don't have NAs in mydata.  I just don't
        understand
        how it happened.
        3. Can't see the real difference between 'FUN = function(x)
        paste(x)'
        and 'FUN = paste'. However, former works perfectly while
        latter simply
        do not.

All I can follow from code above is that R breaks data on

        groups with
        same id, then it tear each little 'cycle' piece in separate
        characters,
        then sorts them and put together these characters within same
        id on each
        'cycle'. I miss how R put together all this mess back into
        nice data
        frame of long format. NAs is also a question, as I said
        before.

By default, aggregate() will try to return a data frame. For each id,
it will output the id and the result of the function applied to each
cycle variable, so there should be one row for each id, and n + 1
columns for the n cycle variables + id.

Does that help?

Cheers,

DennisCould you please put some light on it if you don't mind to

        answer those
        naive  questions.

______________________________________________

        R-help@r-project.org mailing list
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide
        http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible
        code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] meaning of formula in aggregate function

Reply via email to