On Feb 6, 2011, at 7:41 PM, Hadley Wickham wrote:

There's definitely something amiss with aggregate() here since similar functions from other packages can reproduce your 'control' sum. I expect ddply() will have some timing issues because of all the subgrouping in your data frame, but data.table did very well and the summaryBy() function in the
doBy package did OK:

Well, if you use the right plyr function, it works just fine:

system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
"x7", "x8"), "y"))
#   user  system elapsed
#  9.754   1.314  11.073

Which illustrates something that I've believed for a while about
data.table - it's not the indexing that speed things up, it's the
custom data structure.  If you use ddply with data frames, it's slow
because data frames are slow.  I think the right way to resolve this
is to to make data frames more efficient, perhaps using some kind of
mutable interface where necessary for high-performance operations.

Data.frames are also "fat". Simply adding a single new column to a dataset bordering on "large" (5 million rows by 200 columns) requires more memory than even twice the size of the full dataframe. (Paging ensues on a Mac with 24GB.) Unless, of course, there is a more memory- efficient strategy than:

 dfrm$newcol <- with(dfrm, func(variables) ).

The table() operation on the other hand is blazingly fast and requires practically no memory.

--

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to