[R] A manipulation problem for a large data set in R

Giuseppe Paleologo Wed, 27 Aug 2008 08:46:54 -0700

I have two questions for the group. One is very concrete, and is dangerously
close to a "please do my homework" posting. The second follows from the
first one but is more general. I would welcome the advice of experienced R
users.


As for the first one: I have a data frame with two variables

X  Y
A,   chris
D,   chris
B,   chris
B,   chris
C,   andrew
E,   andrew
C,   andrew
B,   beth
D,  chris
D,   beth
C,   beth
D,   beth
D,   beth
A,   andrew
A,   andrew
A,   andrew
C,   chris
B,   beth
D,   chris
E,   andrew
D,   chris
D,   beth
D,   chris
A,   andrew
A,   chris
C    chris
A    chris
B    chris
C    beth
A    chris

I would like to produce a table, with one row for every level of the factor
X, and multiple columns, filled with the observed levels of the factor Y
that are observed jointly with X. Hence:

X   Z1  Z2  Z3
A,  andrew,  chris
B,  chris beth,  chris
C,  andrew,  beth,  chris
D,  chris,  beth
E,  andrew

A solution would be to something like

temp = tapply(Y, X, function(a) levels(a[,drop=TRUE])))

and then putting the output in an appropriately sized data frame. The issue
I have with this is that it is inelegant and rather slow for my typical data
set (~200K rows). So I was wondering if a more efficient, nicer solution
exists.

This leads me to a second question. Maybe out of laziness, maybe because R
is good enough, I tend to do all my local data manipulations in R. This
includes de-duping records, joining tables, and grouping observations. I do
this also for larger data sets (say, dense tables with 100M+ elements). Is
this current practice among R users? If so, is there a tutorial, or an R
view on it?  If not, what do you use?

Thanks in advance,

-gappy

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] A manipulation problem for a large data set in R

Reply via email to