On Sun, Jun 8, 2008 at 10:18 AM, Phil Steitz <[EMAIL PROTECTED]> > Its probably best to take the discussion to the dev list. ~ Hi, ~ this thread started in [EMAIL PROTECTED] as "commons.apache.org/math/stat/" ~ Formal need for a way to keep incremental statistics as part of the package: ~ If you do heavy data analysis/mining you will greatly benefit from being able to keep the data statistics as part of the data itself and not having to create stat.descriptive.DescriptiveStatistics objects for each pattern (and I am not talking about hypothetical scenarios here, I encounter such problems out of parsing stats related to linguistic pattens in large bodies of texts, like those you found at large text banks like the one that the gutenberg.org project hosts) ~ When you have lots of patterns which frequency distributions you are interested in you don't really want to internally "maintain datasets of values for each of them and compute descriptive statistics based on stored data" you would easily keep a data structure that looks like this: ~ class sdt{ public String pattern; // pattern public long lastOffset; // __ public int tmsFnd; // times found public double mean; // Mean public double stdDev; // standard deviation of data distribution public double skew; // skewness ^ } ~ which objects you would update as needed based on the latest stats you have kept within the object and the newly found value (which could be, e. g., the difference of the offset to the previous value) ~ By the way, after studying a bit more the structure of the API I would agree with you it should go in "org.apache.commons.math.stat.StatUtils" ~ Here is the naively deceiving Math it entails and I will need to use some ascii "art" here: ~ the mean for the variable X is defined as: ~ Mean(X, N) := Mean(xi, i[1, N]) := (x1 + x2 + x3 + . . . + xn)/N ~ Now, when the new (N + 1) value happen the Mean becomes ~ Mean(X, (N + 1)) := Mean(xi, i[1, (N + 1)]) := (x1 + x2 + x3 + . . . + xn + x(n+1))/(N + 1) ~ Algebraically playing a bit with it we get: ~ (N + 1) * Mean(X, (N + 1)) := (x1 + x2 + x3 + . . . + xn + x(n+1)) ~ (N + 1) * Mean(X, (N + 1)) := N * Mean(X, N) + x(n+1) ~ So, that we can naturally (it is a "simple" additive induction), express the new Mean as a function of the old Mean and the new value: ~ I) Mean(X, (N + 1)) := (N/(N + 1))* Mean(X, N) + (x(n+1)/(N + 1)) ~ So for I) you will need ~ 1) "N" 2) new value x(n+1) 3) the old Mean ~ For the std Dev you will do so similarly that after explaining how it is for the mean it doesn't really need to be rolled out, you will then need: ~ 1) "N" 2) new value x(n+1) 3) the old Mean 4) the old std Dev ~ and for the Skewness you will need: ~ 1) "N" 2) new value x(n+1) 3) the old Mean 4) the old std Dev 5) the old Skewness ~ I know that just looks nice ;-), but computers are not good at figuring out what to do with all the rounding errors that will certainly appear as byproduct of all these "simply" looking Math. I am sure you must be using some kind of magic in order to offset those ~ > . . . Patches are always welcome. ~ Let me know how can I help you (/commons.apache.org/math/stat/) I could code some actual java but I am not familiar with the guts of the API/underlying framework so you would decide if you want to invite me as an upstream developer and mentor me initially or if you would just be happy with my suggestion. By the way I am a Mathematician (actually a theoretical Physicist myself) ~ See you lbrtchx ~
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]