Re: [math] speeding up percentile based statistics

Luc Maisonobe Sat, 18 Sep 2010 07:25:23 -0700

Le 18/09/2010 16:16, Phil Steitz a écrit :
> On 9/17/10 3:19 PM, Luc Maisonobe wrote:
>> Le 17/09/2010 19:55, Ted Dunning a écrit :
>>> There are also on-line percentile estimation methods that require only a
>>> single pass over the data in order to get good estimates of several
>>> quantile
>>> at the same time.
>>>
>>> If you are interested in getting an estimate of some quantile of the
>>> underlying distribution that generated your data then these on-line
>>> methods
>>> will give you an estimate that is very nearly as accurate as sorting
>>> your
>>> sample.  Sorting gives you the exact quantiles of your sample, but
>>> only an
>>> estimate of the quantiles of your underlying distribution.
>>>
>>> There is a simplified implementation of this in Mahout along with
>>> test cases
>>> that demonstrate reasonable accuracy.
>>>
>>> These techniques are well described in the article: *Incremental
>>> Quantile
>>> Estimation for Massive
>>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>
>>>
>>> *
>>> *
>>> *
>>> *(available at
>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )*
>>
>> Thanks for the pointer, it looks interesting.
>>
>>> *
>>> *
>>> *Regarding the incremental selection method to find multiple
>>> quantiles, I
>>> think that you save a little bit if you are looking for a few
>>> quantiles, but
>>> the added complexity and special cases are going to make testing
>>> difficult.
>>>   Wouldn't it be better to use one of the following strategies:*
>>> *
>>> *
>>> *- keep a copy of the sorted data and if that copy is available, just
>>> use
>>> it.  This cuts the cost of 100 quantiles probed incrementally to a
>>> single
>>> sort.
>>
>> This is exactly what this user did: put the sort out of the loop.
>>
>>> Moreover, since sort is n log n without a radix sort, then the
>>> increment selection algorithm can only win if 100<  log n which is
>>> pretty
>>> unlikely.*
>>
>> I don't understand. Partition-based selection algorithms are basically
>> partition-based sort algorithm where we recurse in only one of the
>> partition once the pivot has been chosen. Subsequent calls therefore
>> don't restart from an array of size n but from smaller sub-arrays, has
>> the pivot can be saved (at least the top level ones). If at the end the
>> number of selections is so high the array ends up to be completely
>> sorted, the total cost is probably not much higher than what an initial
>> sort would have done. It will be higher since their are some bookkeeping
>> to do, but not so much higher I think. Doing one call corresponds to
>> resuming the partial sort from the state resulting from previous calls.
>>
>> Did I miss something ?
>>
>>
>>> *
>>> *
>>> *- add a method to probe multiple quantiles at the same time.  This
>>> potentially uses less memory than the first approach, but is
>>> dependent on
>>> the user calling the right method.*
>>
>> Since there are different use case, having several methods seems fair. A
>> multiple quantiles method is a good idea.
>>
>>> *
>>> *
>>> *- use the on-line algorithm mentioned above with two passes, one to
>>> estimate the quantiles of interest, a second to refine the estimate
>>> using
>>> the actual data.  This allows the multiple probe method to be linear
>>> in time
>>> and should give you exact sample quantiles.  It doesn't help the
>>> repeated
>>> probe problem.*
>>
>> I'm not sure I understand well. However, the on-line method by itself
>> would be an interesting addition. It would allow quantiles computation
>> with the Storeless versions of our classes.
> 
> +1 - definitely valuable for the storeless case.
> 
> I need to think about this some more / see some patches; but my initial
> reaction is that we can get a big bang from just doing what the user did
> (caching the sorted array and inserting into it when addValue is called)
> for the data-in-memory impls.  I would suggest implementing that for for
> Percentile itself (taking care to handle addValue and rolling windows
> correctly) and DescriptiveStatistics;


This would however handle only the multiple quantiles use case, not the
case where only one quantile is computed that could be provided with a
selection in O(n) rather than a sort in O(n ln(n)).

Luc

> but add a new (storelesss)
> statistic based on an incremental quantile estimation algorithm and make
> this accessible via the storeless aggregate, SummaryStatistics.  It
> might be interesting to include this optionally with
> DescriptiveStatistics as well, so that in the rolling window case you
> could compare the current window distribution to the full sample.
> 
> Phil
> 
> 
>>
>> Luc
>>
>>> *
>>> *
>>> On Fri, Sep 17, 2010 at 10:34 AM, Luc
>>> Maisonobe<luc.maison...@free.fr>wrote:
>>>
>>>> Hi all,
>>>>
>>>> During a recent face to face discussion with some commons-math users
>>>> from a large project, it appeared one implementation choice for
>>>> Percentile statistics was a huge performance bottleneck.
>>>>
>>>> When the Percentile.evaluate method is called, the sample array is
>>>> copied and sorted. In one use case, 100 calls were made for each
>>>> sample,
>>>> so the same array was sorted 100 times. In another use case, only one
>>>> call is made (typically to compute the median) but the user wondered
>>>> why
>>>> the array should be completely sorted. In fact, using a selection
>>>> algorithm instead of a sorting algorithm would be sufficient.
>>>>
>>>> I would like to have you opinion about providing a different evaluate
>>>> method that would process an array provided previously and use only
>>>> selections to provide the percentile. Consider for example the
>>>> following
>>>> input array:
>>>>
>>>>   [ -6, 8, 4, -5, 0, 2, 7 ]
>>>>
>>>> If we first as for the median, a selection algorithm may reorganize the
>>>> array as:
>>>>
>>>>   [ -6, 0, -5, 2, 8, 4, 7 ]
>>>>
>>>> were the left part is smaller than the central element, the right part
>>>> is larger and hence the central element 2 is known to be the median.
>>>>
>>>> Then we ask for another value, the 25% percentile. Since the object
>>>> already knows the smaller elements are in the left part, it can use a
>>>> select algorithm on the 3 elements left part and extract the value -5
>>>> without even trying to sort the upper part of the array.
>>>>
>>>> What do you think ?
>>>>
>>>> Luc
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] speeding up percentile based statistics

Reply via email to