I propose a new module named Statistics::Descriptive::Discrete

Background:
I use the full version (Statistics::Descriptive::Full) of the
Statistics::Descriptive module quite a bit and it works
great.  However, because it stores a copy of all the data
in an array (and also needs to sort this array), the module
is quite slow for large datasets (> 1 million data values).
I frequently use data sets of this size or larger so I
devised a faster solution.

My solution:
Since most of the data I need to analyze is the
output of satellite telemetry it has all been through an
analog to digital conversion so there are a known, discrete
number of "levels" or different values present in the data.
For example, an 8 bit value would only have 2^8 possible
values even though there might be 3 million such data
points in my data set.

Instead of storing every value in an array, I only store the
particular values I've seen as hash keys then store the
number of times I've seen each  value as the hash value.
This lets me reconstruct the original data set  (but not the
original order) with very little overhead -- particularly
when  the number of discrete values is small compared to the
total number of data  points.

Performance:
The results are VERY good.  For a real-world
data set using 2.6 million 8-bit values,
Statistics::Descriptive::Full took 561 seconds to produce
results (min, max, mean, mode, median, standard deviation,
and frequency  distribution) and my module took 40 seconds
to produce the same results.   Futhermore,
Statistics::Descriptive::Full required 400MB of RAM and my
module required 3MB.  Statistics::Descriptive::Full scales
somewhat exponentially with number of data points whereas
my new module scales linearly.

Testing:
I've tested this module using quite a few data sets
and performance is very  good as long as the number of
discrete levels in the data is some fraction of the total
number of data points (more testing needed to find exactly what that 
fraction is).  Even with 2^20 possible levels, this
beats Statistics::Descriptive::Full as long as the number
of data points is high.

Naming:
Please note that this new module is NOT a
replacement for  Statistics::Descriptive (which is a very
good, well tested module) but it does perform much better
for certain data sets.  So, I think I should stay in the
Statistics::Descriptive namespace which is why I propose
something like Statistics::Descriptive::Discrete or
Statistics::Descriptive::Quantized.  Also note
that I've kept the interface identical to
Statistics::Descriptive as much as possible so this should
be a drop-in replacement for most purposes.  I've changed several of my 
scripts by simply changing the use statement and the call to new.

I've tried to contact the maintainer of Statistics::Descriptive (Colin 
Kuskie) at his PAUSE address as well as an address I found on usenet but 
both messages bounced.  Therefore, I haven't run this by him.

More info:
You can read my write-up over at perlmonks.org:
http://www.perlmonks.org/index.pl?lastnode_id=6364&node_id=146691

Or see my post with some sample code on the module-authors archive:
http:[EMAIL PROTECTED]/msg00304.html

I'd appreciate any input on the namespace for this module.  It's my first 
CPAN submission.

Regards, --Rhet Turnbull, RhetTbull_at_hotmail.com, CPAN ID: RHETTBULL



_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx

Reply via email to