That is a fine answer for some things, but the parallel cases fail. My feeling is that there are a few cases where there are nice aggregatable summary statistics like moments and there are many cases where this just doesn't work well (such as rank statistics). For the latter, case I usually make do with a surrogate such as a random sub-sample or a recency weighted random sub-sample combined with a few aggregatable stats such as total samples, max, min, sum and second moment. That gives me most of what I want and if the sub-sample is reasonably large, I can sometimes estimate a few parameters such as total uniques. The sub-sampled data streams can be combined trivially so I now have a aggregatable approximation of non-aggregatable statistics. For descriptive quantiles this is generally just fine.
On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <thinma...@yahoo.com> wrote: > The key would be to generate the aggregate statistics at the same time as > the per-partition ones, instead of aggregating them after the fact. -- Ted Dunning, CTO DeepDyve