There are also on-line percentile estimation methods that require only a
single pass over the data in order to get good estimates of several quantile
at the same time.

If you are interested in getting an estimate of some quantile of the
underlying distribution that generated your data then these on-line methods
will give you an estimate that is very nearly as accurate as sorting your
sample.  Sorting gives you the exact quantiles of your sample, but only an
estimate of the quantiles of your underlying distribution.

There is a simplified implementation of this in Mahout along with test cases
that demonstrate reasonable accuracy.

These techniques are well described in the article: *Incremental Quantile
Estimation for Massive
Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>
*
*
*
*(available at
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )*
*
*
*Regarding the incremental selection method to find multiple quantiles, I
think that you save a little bit if you are looking for a few quantiles, but
the added complexity and special cases are going to make testing difficult.
 Wouldn't it be better to use one of the following strategies:*
*
*
*- keep a copy of the sorted data and if that copy is available, just use
it.  This cuts the cost of 100 quantiles probed incrementally to a single
sort.  Moreover, since sort is n log n without a radix sort, then the
increment selection algorithm can only win if 100 < log n which is pretty
unlikely.*
*
*
*- add a method to probe multiple quantiles at the same time.  This
potentially uses less memory than the first approach, but is dependent on
the user calling the right method.*
*
*
*- use the on-line algorithm mentioned above with two passes, one to
estimate the quantiles of interest, a second to refine the estimate using
the actual data.  This allows the multiple probe method to be linear in time
and should give you exact sample quantiles.  It doesn't help the repeated
probe problem.*
*
*
On Fri, Sep 17, 2010 at 10:34 AM, Luc Maisonobe <luc.maison...@free.fr>wrote:

> Hi all,
>
> During a recent face to face discussion with some commons-math users
> from a large project, it appeared one implementation choice for
> Percentile statistics was a huge performance bottleneck.
>
> When the Percentile.evaluate method is called, the sample array is
> copied and sorted. In one use case, 100 calls were made for each sample,
> so the same array was sorted 100 times. In another use case, only one
> call is made (typically to compute the median) but the user wondered why
> the array should be completely sorted. In fact, using a selection
> algorithm instead of a sorting algorithm would be sufficient.
>
> I would like to have you opinion about providing a different evaluate
> method that would process an array provided previously and use only
> selections to provide the percentile. Consider for example the following
> input array:
>
>  [ -6, 8, 4, -5, 0, 2, 7 ]
>
> If we first as for the median, a selection algorithm may reorganize the
> array as:
>
>  [ -6, 0, -5, 2, 8, 4, 7 ]
>
> were the left part is smaller than the central element, the right part
> is larger and hence the central element 2 is known to be the median.
>
> Then we ask for another value, the 25% percentile. Since the object
> already knows the smaller elements are in the left part, it can use a
> select algorithm on the 3 elements left part and extract the value -5
> without even trying to sort the upper part of the array.
>
> What do you think ?
>
> Luc
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Reply via email to