That is a fine answer for some things, but the parallel cases fail.

My feeling is that there are a few cases where there are nice aggregatable
summary statistics like moments and there are many cases where this just
doesn't work well (such as rank statistics).  For the latter, case I usually
make do with a surrogate such as a random sub-sample or a recency weighted
random sub-sample combined with a few aggregatable stats such as total
samples, max, min, sum and second moment.  That gives me most of what I want
and if the sub-sample is reasonably large, I can sometimes estimate a few
parameters such as total uniques.  The sub-sampled data streams can be
combined trivially so I now have a aggregatable approximation of
non-aggregatable statistics.  For descriptive quantiles this is generally
just fine.

On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <thinma...@yahoo.com> wrote:

> The key would be to generate the aggregate statistics at the same time as
> the per-partition ones, instead of aggregating them after the fact.




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to