Re: [math] speeding up percentile based statistics

Phil Steitz Sat, 18 Sep 2010 08:36:51 -0700

On 9/18/10 10:24 AM, Luc Maisonobe wrote:

Le 18/09/2010 16:16, Phil Steitz a écrit :

On 9/17/10 3:19 PM, Luc Maisonobe wrote:

Le 17/09/2010 19:55, Ted Dunning a écrit :

There are also on-line percentile estimation methods that require only a
single pass over the data in order to get good estimates of several
quantile
at the same time.


If you are interested in getting an estimate of some quantile of the
underlying distribution that generated your data then these on-line
methods
will give you an estimate that is very nearly as accurate as sorting
your
sample.  Sorting gives you the exact quantiles of your sample, but
only an
estimate of the quantiles of your underlying distribution.

There is a simplified implementation of this in Mahout along with
test cases
that demonstrate reasonable accuracy.

These techniques are well described in the article: *Incremental
Quantile
Estimation for Massive
Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>

*
*
*
*(available at
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )*


Thanks for the pointer, it looks interesting.

*
*
*Regarding the incremental selection method to find multiple
quantiles, I
think that you save a little bit if you are looking for a few
quantiles, but
the added complexity and special cases are going to make testing
difficult.
   Wouldn't it be better to use one of the following strategies:*
*
*
*- keep a copy of the sorted data and if that copy is available, just
use
it.  This cuts the cost of 100 quantiles probed incrementally to a
single
sort.


This is exactly what this user did: put the sort out of the loop.

Moreover, since sort is n log n without a radix sort, then the
increment selection algorithm can only win if 100<   log n which is
pretty
unlikely.*


I don't understand. Partition-based selection algorithms are basically
partition-based sort algorithm where we recurse in only one of the
partition once the pivot has been chosen. Subsequent calls therefore
don't restart from an array of size n but from smaller sub-arrays, has
the pivot can be saved (at least the top level ones). If at the end the
number of selections is so high the array ends up to be completely
sorted, the total cost is probably not much higher than what an initial
sort would have done. It will be higher since their are some bookkeeping
to do, but not so much higher I think. Doing one call corresponds to
resuming the partial sort from the state resulting from previous calls.

Did I miss something ?

*
*
*- add a method to probe multiple quantiles at the same time.  This
potentially uses less memory than the first approach, but is
dependent on
the user calling the right method.*


Since there are different use case, having several methods seems fair. A
multiple quantiles method is a good idea.

*
*
*- use the on-line algorithm mentioned above with two passes, one to
estimate the quantiles of interest, a second to refine the estimate
using
the actual data.  This allows the multiple probe method to be linear
in time
and should give you exact sample quantiles.  It doesn't help the
repeated
probe problem.*


I'm not sure I understand well. However, the on-line method by itself
would be an interesting addition. It would allow quantiles computation
with the Storeless versions of our classes.


+1 - definitely valuable for the storeless case.

I need to think about this some more / see some patches; but my initial
reaction is that we can get a big bang from just doing what the user did
(caching the sorted array and inserting into it when addValue is called)
for the data-in-memory impls.  I would suggest implementing that for for
Percentile itself (taking care to handle addValue and rolling windows
correctly) and DescriptiveStatistics;


This would however handle only the multiple quantiles use case, not the
case where only one quantile is computed that could be provided with a
selection in O(n) rather than a sort in O(n ln(n)).

Correct. Ideal would be to do the sort incrementally as youdescribe for the data-in-memory impl. Would be tricky to implementin Percentile, given array index parameters and the need to handleaddValue and window in DescriptiveStatistics; but this should be doable.

Lets open 2 JIRAs for this - one straight performance improvementfor Percentile and the other for a storeless percentile estimator.


Phil

Luc

but add a new (storelesss)
statistic based on an incremental quantile estimation algorithm and make
this accessible via the storeless aggregate, SummaryStatistics.  It
might be interesting to include this optionally with
DescriptiveStatistics as well, so that in the rolling window case you
could compare the current window distribution to the full sample.

Phil

Luc

*
*
On Fri, Sep 17, 2010 at 10:34 AM, Luc
Maisonobe<luc.maison...@free.fr>wrote:

Hi all,

During a recent face to face discussion with some commons-math users
from a large project, it appeared one implementation choice for
Percentile statistics was a huge performance bottleneck.

When the Percentile.evaluate method is called, the sample array is
copied and sorted. In one use case, 100 calls were made for each
sample,
so the same array was sorted 100 times. In another use case, only one
call is made (typically to compute the median) but the user wondered
why
the array should be completely sorted. In fact, using a selection
algorithm instead of a sorting algorithm would be sufficient.

I would like to have you opinion about providing a different evaluate
method that would process an array provided previously and use only
selections to provide the percentile. Consider for example the
following
input array:

   [ -6, 8, 4, -5, 0, 2, 7 ]

If we first as for the median, a selection algorithm may reorganize the
array as:

   [ -6, 0, -5, 2, 8, 4, 7 ]

were the left part is smaller than the central element, the right part
is larger and hence the central element 2 is known to be the median.

Then we ask for another value, the 25% percentile. Since the object
already knows the smaller elements are in the left part, it can use a
select algorithm on the 3 elements left part and extract the value -5
without even trying to sort the upper part of the array.

What do you think ?

Luc

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] speeding up percentile based statistics

Reply via email to