[math] MATH-224 - need a better idea

Phil Steitz Sun, 19 Apr 2009 08:34:56 -0700

We should be able to find a clean way to do what this enhancementrequest is asking for. I am feeling stupid because even when I considerbreaking compatibility / refactoring to use generics, I can't find asimple way to do it. Here is a description of the current API and somefailed ideas that I have considered so far. As usual, I would like tominimize pain for current users in addressing this, but at this point Iam starting to think that wholesale refactoring is necessary and I wouldappreciate ideas on the best way to do this.

SummaryStatistics provides "storeless" computation of summary statistics- min, max, mean, variance, etc. Here "storeless" means that the classdoes not hold the stream of data in memory. It was designed to supportpluggable implementations of the statistics that it computes. It doesthis in a way that looks smelly in the new world of type-safe Java(well, maybe it always smelled ;) The injectable implementation classesin SummaryStatistics are typed as "StorelessUnivariateStatistic" whichis an interface that includes things like getResult() andincrement(double). There is nothing preventing, for example, a varianceimplementation from being "plugged in" to implement the mean.

The request in MATH-224 is to support aggregation in the followingsense: SummaryStatistics instance 1 gets a stream of values andinstance 2 gets another stream of values and we want to create a newinstance or replace instance 1 with an instance that behaves as thoughit got all the data from both streams. The simplest way to do thiswould be to add an "aggregate" method to theStorelessUnivariateStatistic interface and then just implementaggregation in SummaryStatistics by delegation to the implementationinstances. This is essentially what the patch attached to MATH-224does. The problem with this approach is that supporting aggregation isa fairly strong requirement in general, stronger than just requiringthat the statistic be computable without storing the data. Strongerstill is the requirement that an implementation of a statistic be"aggregatable" with a possibly different implementation (since then itwould have access only to the value of the other statistic).


So the challenge is can we find a clean way to achieve the four objectives:

0) Maintain pluggability of statistics implementations
1) Support aggregation
2) Improve type safety
3) Minimize trauma for current users

Dropping 0) makes things much simpler, but I would like to avoid thatunless there is really no way to accomplish 1) and 2) without takingthat step. Strictly speaking, 1) may be impossible as I know of no wayto support this for the higher moments. I would be OK with aggregationforcing these to NaN (documented, of course).

My first thought was to define a parameterized Aggregatable interfacethat requires the same types. Then two SummaryStatistics instances areaggregatable iff their implementation statistics match types. I am OKwith these restrictions, but am having trouble actually making it work.


Suggestions / patches welcome!

Phil



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[math] MATH-224 - need a better idea

Reply via email to