On Feb 6, 2011, at 7:41 PM, Hadley Wickham wrote:
There's definitely something amiss with aggregate() here since
similar
functions from other packages can reproduce your 'control' sum. I
expect
ddply() will have some timing issues because of all the subgrouping
in your
data frame, but data.table did very well and the summaryBy()
function in the
doBy package did OK:
Well, if you use the right plyr function, it works just fine:
system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
"x7", "x8"), "y"))
# user system elapsed
# 9.754 1.314 11.073
Which illustrates something that I've believed for a while about
data.table - it's not the indexing that speed things up, it's the
custom data structure. If you use ddply with data frames, it's slow
because data frames are slow. I think the right way to resolve this
is to to make data frames more efficient, perhaps using some kind of
mutable interface where necessary for high-performance operations.
Data.frames are also "fat". Simply adding a single new column to a
dataset bordering on "large" (5 million rows by 200 columns) requires
more memory than even twice the size of the full dataframe. (Paging
ensues on a Mac with 24GB.) Unless, of course, there is a more memory-
efficient strategy than:
dfrm$newcol <- with(dfrm, func(variables) ).
The table() operation on the other hand is blazingly fast and requires
practically no memory.
--
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.