Albretch Mueller wrote:
On Sun, Jun 8, 2008 at 10:18 AM, Phil Steitz <[EMAIL PROTECTED]>
Its probably best to take the discussion to the dev list.
~
Hi,
~
this thread started in [EMAIL PROTECTED] as
"commons.apache.org/math/stat/"
~
Formal need for a way to keep incremental statistics as part of the package:
~
If you do heavy data analysis/mining you will greatly benefit from
being able to keep the data statistics as part of the data itself and
not having to create stat.descriptive.DescriptiveStatistics objects
for each pattern (and I am not talking about hypothetical scenarios
here, I encounter such problems out of parsing stats related to
linguistic pattens in large bodies of texts, like those you found at
large text banks like the one that the gutenberg.org project hosts)
~
When you have lots of patterns which frequency distributions you are
interested in you don't really want to internally "maintain datasets
of values for each of them and compute descriptive statistics based on
stored data" you would easily keep a data structure that looks like
this:
~
class sdt{
public String pattern; // pattern
public long lastOffset;
// __
public int tmsFnd; // times found
public double mean; // Mean
public double stdDev; // standard deviation of data distribution
public double skew; // skewness ^
}
~
which objects you would update as needed based on the latest stats
you have kept within the object and the newly found value (which could
be, e. g., the difference of the offset to the previous value)
If you are willing to drag along UnivariateStatistics in the structures
above, you could accomplish this by calling their increment values. The
statistics (such as the ones above) that do not require that their
complete supporting datasets be stored are implemented in commons math
as "StorelessUnivariateStatistics". These objects expose increment()
methods that allow them to be updated based on new data values. See,
e.g. org.apache.commons.math.stat.descriptive.moment.Mean.
tr~
By the way, after studying a bit more the structure of the API I
would agree with you it should go in
"org.apache.commons.math.stat.StatUtils"
Yes. I can see the usefulness of this for cases where the current API
is too heavy. What probably makes sense is individual update methods
for common statistics that admit this.
~
Here is the naively deceiving Math it entails and I will need to use
some ascii "art" here:
~
the mean for the variable X is defined as:
~
Mean(X, N) := Mean(xi, i[1, N]) := (x1 + x2 + x3 + . . . + xn)/N
~
Now, when the new (N + 1) value happen the Mean becomes
~
Mean(X, (N + 1)) := Mean(xi, i[1, (N + 1)]) := (x1 + x2 + x3 + . . .
+ xn + x(n+1))/(N + 1)
~
Algebraically playing a bit with it we get:
~
(N + 1) * Mean(X, (N + 1)) := (x1 + x2 + x3 + . . . + xn + x(n+1))
~
(N + 1) * Mean(X, (N + 1)) := N * Mean(X, N) + x(n+1)
~
So, that we can naturally (it is a "simple" additive induction),
express the new Mean as a function of the old Mean and the new value:
~
I) Mean(X, (N + 1)) := (N/(N + 1))* Mean(X, N) + (x(n+1)/(N + 1))
~
So for I) you will need
~
1) "N"
2) new value x(n+1)
3) the old Mean
~
For the std Dev you will do so similarly that after explaining how it
is for the mean it doesn't really need to be rolled out, you will then
need:
~
1) "N"
2) new value x(n+1)
3) the old Mean
4) the old std Dev
~
and for the Skewness you will need:
~
1) "N"
2) new value x(n+1)
3) the old Mean
4) the old std Dev
5) the old Skewness
~
I know that just looks nice ;-), but computers are not good at
figuring out what to do with all the rounding errors that will
certainly appear as byproduct of all these "simply" looking Math. I am
sure you must be using some kind of magic in order to offset those
~
. . . Patches are always welcome.
~
Let me know how can I help you (/commons.apache.org/math/stat/) I
could code some actual java but I am not familiar with the guts of the
API/underlying framework so you would decide if you want to invite me
as an upstream developer and mentor me initially or if you would just
be happy with my suggestion. By the way I am a Mathematician (actually
a theoretical Physicist myself)thr
We are always happy to welcome new contributors to commons or any other
apache project. The best way to start working on commons math is to
check out the developers page:
http://commons.apache.org/math/developers.html. Follow directions there
to get set up with subversion and maven or Ant and Junit to build the
code and run the tests. Then follow the instructions on submitting
patches through JIRA and you are off to the races :)
Please do not hesitate to ask here or mail me personally if you need
help getting set up to build and test the code.
Phil
~
See you
lbrtchx
~
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]