Re: [R] aggregate function - na.action

David Winsemius Sun, 06 Feb 2011 17:46:11 -0800


On Feb 6, 2011, at 7:41 PM, Hadley Wickham wrote:

There's definitely something amiss with aggregate() here sincesimilarfunctions from other packages can reproduce your 'control' sum. Iexpectddply() will have some timing issues because of all the subgroupingin yourdata frame, but data.table did very well and the summaryBy()function in the
doBy package did OK:
Well, if you use the right plyr function, it works just fine:

system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
"x7", "x8"), "y"))
#   user  system elapsed
#  9.754   1.314  11.073

Which illustrates something that I've believed for a while about
data.table - it's not the indexing that speed things up, it's the
custom data structure.  If you use ddply with data frames, it's slow
because data frames are slow.  I think the right way to resolve this
is to to make data frames more efficient, perhaps using some kind of
mutable interface where necessary for high-performance operations.

Data.frames are also "fat". Simply adding a single new column to adataset bordering on "large" (5 million rows by 200 columns) requiresmore memory than even twice the size of the full dataframe. (Pagingensues on a Mac with 24GB.) Unless, of course, there is a more memory-efficient strategy than:


 dfrm$newcol <- with(dfrm, func(variables) ).

The table() operation on the other hand is blazingly fast and requirespractically no memory.


--

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] aggregate function - na.action

Reply via email to