Hi:
I wouldn't pretend to speak for Henrique, but I'll give it a shot.
On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiew...@gmail.com> wrote:
Dear R community
Recently, dear Henrique Dallazuanna literally saved me solving
one
problem on data transformation which follows:
(n_, _n, j_, k_ signify numbers)
SOURCE DATA:
id cycle1 cycle2 cycle3 … cycle_n
1 c c c c
1 m m m m
1 f f f f
2 m m m NA
2 f f f NA
2 c c c NA
3 a a NA NA
3 c c c NA
3 f f f NA
3 NA NA m NA
...........................................
Q: How to transform source data to:
RESULT DATA:
id cyc1 cyc2 cyc3 … cyc_n
1 cfm cfm cfm cfm
2 cfm cfm cfm
3 acf acf cfm
...........................................
The Henrique's solution is:
aggregate(.~ id, lapply(df, as.character), FUN =
function(x)paste(sort(x), collapse = ''), na.action = na.pass)
The first part, . ~ id, is the formula. It's using every available
variable in the input data on the left hand side of the formula except
for id, which is the grouping variable.
The data object is lapply(df, as.character), which is a list object
that translates every element to character. I'm guessing that each
element of the list is a character string or list of character
strings, but I'm not sure. It looks like the individual characters of
each cycle comprise a list component within id. (??) [My guess: the
result of lapply() is a list of lists. The top-level list components
correspond to the id's, while the second-level components are the
cycle variables, whose elements are the characters in each cycle
variable for each row with the same id.]
The function to be applied to each id is described in FUN. As Peter
mentioned, it's an 'anonymous' function, which means it is defined
in-line. In this case, a generic input object x has its elements
sorted in increasing order and then combines the elements into a
single string (the purpose of collapse = ); NA values are skipped
over. Thus, if my hypothesis about the structure of the list is
correct, the three characters in each cycle/id combination are first
sorted and then combined into a single string, which is then output as
the result. By the way that Henrique used the formula, the aggregate()
function will march through each cycle variable within id and execute
the function, and then iterate the process over all id's.
Could somebody EXPLAIN HOW IT WORKS?
I mean Henrique saved my investigation indeed.
However, considering the fact, that I am about to perform
investigation
of cancer chemotherapy in 500 patients, it would be nice to
know what
I am actually doing.
Henrique's R knowledge is on a different level from most of us, so I
understand your question :)
1. All help says about LHS in formulas like '.~id' is that
it's
name is "dot notation". And not a single word more. Thus, I
have no
clue, what dot in that formula really means.
. is shorthand for 'everything not otherwise specified in the model
formula'. In this case, it represents the entire set of cycle
variables.
2. help says:
Note that ‘paste()’ coerces ‘NA_character_’, the character
missing
value, to ‘"NA"'
And at the same time:
‘na.pass’ returns the object unchanged.
I am happy, that I don't have NAs in mydata. I just don't
understand
how it happened.
3. Can't see the real difference between 'FUN = function(x)
paste(x)'
and 'FUN = paste'. However, former works perfectly while
latter simply
do not.
All I can follow from code above is that R breaks data on
groups with
same id, then it tear each little 'cycle' piece in separate
characters,
then sorts them and put together these characters within same
id on each
'cycle'. I miss how R put together all this mess back into
nice data
frame of long format. NAs is also a question, as I said
before.
By default, aggregate() will try to return a data frame. For each id,
it will output the id and the result of the function applied to each
cycle variable, so there should be one row for each id, and n + 1
columns for the n cycle variables + id.
Does that help?
Cheers,
Dennis
Could you please put some light on it if you don't mind to
answer those
naive questions.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.