Re: [math] MATH-224 - need a better idea

Phil Steitz Tue, 21 Apr 2009 03:20:44 -0700

John Bollinger wrote:

The same approach could certainly be applied for DescriptiveStatistics, but the 
variable window complicates things: if a finite window is selected for the 
aggregate statistics then they will be sensitive to the order in which values 
are added to the contributing per-partition statistics.  That problem exists no 
matter when the aggregation is performed, however, and I guess the order we 
would get is reasonably likely to be the desired one.  Also, the 
removeMostRecentValue() and replaceMostRecentValue() methods are a bit tricky 
if they need to cascade to the aggregate statistics because the most recent 
value for one contributor may not be the most recent value for the aggregate.  
Anyway, I'll prepare an AggregateDescriptiveStatistics along the same line as 
my AggregateSummaryStatistics, and then at least we'll have something concrete 
to discuss.  Shall I post it as an additional patch for MATH-224?
DescriptiveStatistics does provide an opportunity for aggregating after the fact that SummaryStatistics doesn't, because each contributing statistic remembers (some of) the values provided to it. On the other hand, users already can manually aggregate DescriptiveStatistics objects. What they cannot easily do after the fact is duplicate the overall order in which values were added to the set of DescriptiveStatistics, and that is exactly what AggregateDescriptiveStatistics will provide. I think I'm rambling now, so I'll stop and write some code.

Always a good idea ^

I was thinking initially of post-hoc aggregation, using the backingdata, but it is worth investigating the approach above. Thanks!


Phil


Regards,

John




________________________________
From: Phil Steitz <[email protected]>
To: Commons Developers List <[email protected]>
Sent: Monday, April 20, 2009 7:01:20 AM
Subject: Re: [math] MATH-224 - need a better idea

Ted Dunning wrote:

That is a fine answer for some things, but the parallel cases fail.

My feeling is that there are a few cases where there are nice aggregatable
summary statistics like moments and there are many cases where this just

doesn't work well (such as rank statistics).

Yes, this is why not all statistics are "storeless."  We have another "summary" class that maintains its data 
in storage and supports "rolling" behavior in DescriptiveStatistics.  The discussion here is focussed on the 
"storeless" case, which is limited to those stats that are computable in this way.  The cases of interest are stats 
that can be computed in one pass through the data but which can't be "aggregated" post hoc.  John's approach provides a 
simple solution to this problem.

For completeness, we should probably similarly implement aggregation in the sense defined in MATH-224 for DescriptiveStatistics as well.Phil

 For the latter, case I usually
make do with a surrogate such as a random sub-sample or a recency weighted
random sub-sample combined with a few aggregatable stats such as total
samples, max, min, sum and second moment.  That gives me most of what I want
and if the sub-sample is reasonably large, I can sometimes estimate a few
parameters such as total uniques.  The sub-sampled data streams can be
combined trivially so I now have a aggregatable approximation of
non-aggregatable statistics.  For descriptive quantiles this is generally
just fine.

On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <[email protected]> wrote:

The key would be to generate the aggregate statistics at the same time as
the per-partition ones, instead of aggregating them after the fact.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [math] MATH-224 - need a better idea

Reply via email to