Re: [math] MATH-224 - need a better idea

Phil Steitz Mon, 20 Apr 2009 03:51:00 -0700

John Bollinger wrote:

I'm looking at commons-math for the first time, but I don't think the feature 
can be implemented as requested in a manner that is suitably generic.  On the 
other hand, I think the same objective could be achieved a different way 
without changing the base API at all.  The key would be to generate the 
aggregate statistics at the same time as the per-partition ones, instead of 
aggregating them after the fact.  That does require knowing beforehand that 
you're going to want the aggregate stats, but I think that's a fair tradeoff.  
This could be done without making client programs update two sets of statistics 
with each datum, by wrapping the each StorelessUnivariateStatistic with an 
implementation that forwards the data to two StorelessUnivariateStatistics -- 
the wrapped one and one for the aggregate.  Almost all the work of setting that 
up can be automated.


I'll see whether I can whip up a proof of concept for you to check out.

I like this approach. As you point out, it avoids entirely the issuesraised above and is actually quite flexible in terms of when streamsstart and end, etc. The only downsides are a) cost of all the"forwarded" increment calls (not likely to be a real practical issue inmost cases) and b) ease of use. I mention b) only because I had tothink for 5 seconds before anticipating how the test case was going tobe coded. I would appreciate feedback from others on this - especiallythose requesting the feature.


Thanks!

Phil

John

________________________________
From: Phil Steitz <phil.ste...@gmail.com>
To: Commons Developers List <dev@commons.apache.org>
Sent: Sunday, April 19, 2009 11:34:24 AM
Subject: [math] MATH-224 - need a better idea

We should be able to find a clean way to do what this enhancement request is 
asking for.  I am feeling stupid because even when I consider breaking 
compatibility / refactoring to use generics, I can't find a simple way to do 
it.  Here is a description of the current API and some failed ideas that I have 
considered so far.   As usual, I would like to minimize pain for current users 
in addressing this, but at this point I am starting to think that wholesale 
refactoring is necessary and I would appreciate ideas on the best way to do 
this.

SummaryStatistics provides "storeless" computation of summary statistics - min, max, mean, variance, etc.  
Here "storeless" means that the class does not hold the stream of data in memory.  It was designed to support 
pluggable implementations of the statistics that it computes.  It does this in a way that looks smelly in the new world 
of type-safe Java (well, maybe it always smelled ;)  The injectable implementation classes in SummaryStatistics are 
typed as "StorelessUnivariateStatistic" which is an interface that includes things like getResult() and 
increment(double).  There is nothing preventing, for example, a variance implementation from being "plugged 
in" to implement the mean.

The request in MATH-224 is to support aggregation in the following sense:  SummaryStatistics 
instance 1 gets a stream of values and instance 2 gets another stream of values and we want to 
create a new instance or replace instance 1 with an instance that behaves as though it got all the 
data from both streams.  The simplest way to do this would be to add an "aggregate" 
method to the StorelessUnivariateStatistic interface and then just implement aggregation in 
SummaryStatistics by delegation to the implementation instances.  This is essentially what the 
patch attached to MATH-224 does.  The problem with this approach is that supporting aggregation is 
a fairly strong requirement in general, stronger than just requiring that the statistic be 
computable without storing the data.  Stronger still is the requirement that an implementation of a 
statistic be "aggregatable" with a possibly different implementation (since then it would 
have access only to the value
 of the other statistic).

So the challenge is can we find a clean way to achieve the four objectives:

0) Maintain pluggability of statistics implementations
1) Support aggregation
2) Improve type safety
3) Minimize trauma for current users

Dropping 0) makes things much simpler, but I would like to avoid that unless 
there is really no way to accomplish 1) and 2) without taking that step.  
Strictly speaking, 1) may be impossible as I know of no way to support this for 
the higher moments.  I would be OK with aggregation forcing these to NaN 
(documented, of course).

My first thought was to define a parameterized Aggregatable interface that 
requires the same types.  Then two SummaryStatistics instances are aggregatable 
iff their implementation statistics match types.  I am OK with these 
restrictions, but am having trouble actually making it work.

Suggestions / patches welcome!

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] MATH-224 - need a better idea

Reply via email to