Re: [math] speeding up percentile based statistics

Luc Maisonobe Sat, 18 Sep 2010 08:40:14 -0700

Le 18/09/2010 17:36, Phil Steitz a écrit :
> On 9/18/10 10:24 AM, Luc Maisonobe wrote:
>> Le 18/09/2010 16:16, Phil Steitz a écrit :
>>> On 9/17/10 3:19 PM, Luc Maisonobe wrote:
>>>> Le 17/09/2010 19:55, Ted Dunning a écrit :
>>>>> There are also on-line percentile estimation methods that require
>>>>> only a
>>>>> single pass over the data in order to get good estimates of several
>>>>> quantile
>>>>> at the same time.
>>>>>
>>>>> If you are interested in getting an estimate of some quantile of the
>>>>> underlying distribution that generated your data then these on-line
>>>>> methods
>>>>> will give you an estimate that is very nearly as accurate as sorting
>>>>> your
>>>>> sample.  Sorting gives you the exact quantiles of your sample, but
>>>>> only an
>>>>> estimate of the quantiles of your underlying distribution.
>>>>>
>>>>> There is a simplified implementation of this in Mahout along with
>>>>> test cases
>>>>> that demonstrate reasonable accuracy.
>>>>>
>>>>> These techniques are well described in the article: *Incremental
>>>>> Quantile
>>>>> Estimation for Massive
>>>>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>
>>>>>
>>>>>
>>>>> *
>>>>> *
>>>>> *
>>>>> *(available at
>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )*
>>>>
>>>> Thanks for the pointer, it looks interesting.
>>>>
>>>>> *
>>>>> *
>>>>> *Regarding the incremental selection method to find multiple
>>>>> quantiles, I
>>>>> think that you save a little bit if you are looking for a few
>>>>> quantiles, but
>>>>> the added complexity and special cases are going to make testing
>>>>> difficult.
>>>>>    Wouldn't it be better to use one of the following strategies:*
>>>>> *
>>>>> *
>>>>> *- keep a copy of the sorted data and if that copy is available, just
>>>>> use
>>>>> it.  This cuts the cost of 100 quantiles probed incrementally to a
>>>>> single
>>>>> sort.
>>>>
>>>> This is exactly what this user did: put the sort out of the loop.
>>>>
>>>>> Moreover, since sort is n log n without a radix sort, then the
>>>>> increment selection algorithm can only win if 100<   log n which is
>>>>> pretty
>>>>> unlikely.*
>>>>
>>>> I don't understand. Partition-based selection algorithms are basically
>>>> partition-based sort algorithm where we recurse in only one of the
>>>> partition once the pivot has been chosen. Subsequent calls therefore
>>>> don't restart from an array of size n but from smaller sub-arrays, has
>>>> the pivot can be saved (at least the top level ones). If at the end the
>>>> number of selections is so high the array ends up to be completely
>>>> sorted, the total cost is probably not much higher than what an initial
>>>> sort would have done. It will be higher since their are some
>>>> bookkeeping
>>>> to do, but not so much higher I think. Doing one call corresponds to
>>>> resuming the partial sort from the state resulting from previous calls.
>>>>
>>>> Did I miss something ?
>>>>
>>>>
>>>>> *
>>>>> *
>>>>> *- add a method to probe multiple quantiles at the same time.  This
>>>>> potentially uses less memory than the first approach, but is
>>>>> dependent on
>>>>> the user calling the right method.*
>>>>
>>>> Since there are different use case, having several methods seems
>>>> fair. A
>>>> multiple quantiles method is a good idea.
>>>>
>>>>> *
>>>>> *
>>>>> *- use the on-line algorithm mentioned above with two passes, one to
>>>>> estimate the quantiles of interest, a second to refine the estimate
>>>>> using
>>>>> the actual data.  This allows the multiple probe method to be linear
>>>>> in time
>>>>> and should give you exact sample quantiles.  It doesn't help the
>>>>> repeated
>>>>> probe problem.*
>>>>
>>>> I'm not sure I understand well. However, the on-line method by itself
>>>> would be an interesting addition. It would allow quantiles computation
>>>> with the Storeless versions of our classes.
>>>
>>> +1 - definitely valuable for the storeless case.
>>>
>>> I need to think about this some more / see some patches; but my initial
>>> reaction is that we can get a big bang from just doing what the user did
>>> (caching the sorted array and inserting into it when addValue is called)
>>> for the data-in-memory impls.  I would suggest implementing that for for
>>> Percentile itself (taking care to handle addValue and rolling windows
>>> correctly) and DescriptiveStatistics;
>>
>> This would however handle only the multiple quantiles use case, not the
>> case where only one quantile is computed that could be provided with a
>> selection in O(n) rather than a sort in O(n ln(n)).
>>
> 
> Correct.  Ideal would be to do the sort incrementally as you describe
> for the data-in-memory impl.  Would be tricky to implement in
> Percentile, given array index parameters and the need to handle addValue
> and window in DescriptiveStatistics; but this should be doable.
> 
> Lets open 2 JIRAs for this - one straight performance improvement for
> Percentile and the other for a storeless percentile estimator.


OK. I'll open them in a few minutes.

Luc

> 
> Phil
> 
> 
>> Luc
>>
>>> but add a new (storelesss)
>>> statistic based on an incremental quantile estimation algorithm and make
>>> this accessible via the storeless aggregate, SummaryStatistics.  It
>>> might be interesting to include this optionally with
>>> DescriptiveStatistics as well, so that in the rolling window case you
>>> could compare the current window distribution to the full sample.
>>>
>>> Phil
>>>
>>>
>>>>
>>>> Luc
>>>>
>>>>> *
>>>>> *
>>>>> On Fri, Sep 17, 2010 at 10:34 AM, Luc
>>>>> Maisonobe<luc.maison...@free.fr>wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> During a recent face to face discussion with some commons-math users
>>>>>> from a large project, it appeared one implementation choice for
>>>>>> Percentile statistics was a huge performance bottleneck.
>>>>>>
>>>>>> When the Percentile.evaluate method is called, the sample array is
>>>>>> copied and sorted. In one use case, 100 calls were made for each
>>>>>> sample,
>>>>>> so the same array was sorted 100 times. In another use case, only one
>>>>>> call is made (typically to compute the median) but the user wondered
>>>>>> why
>>>>>> the array should be completely sorted. In fact, using a selection
>>>>>> algorithm instead of a sorting algorithm would be sufficient.
>>>>>>
>>>>>> I would like to have you opinion about providing a different evaluate
>>>>>> method that would process an array provided previously and use only
>>>>>> selections to provide the percentile. Consider for example the
>>>>>> following
>>>>>> input array:
>>>>>>
>>>>>>    [ -6, 8, 4, -5, 0, 2, 7 ]
>>>>>>
>>>>>> If we first as for the median, a selection algorithm may
>>>>>> reorganize the
>>>>>> array as:
>>>>>>
>>>>>>    [ -6, 0, -5, 2, 8, 4, 7 ]
>>>>>>
>>>>>> were the left part is smaller than the central element, the right
>>>>>> part
>>>>>> is larger and hence the central element 2 is known to be the median.
>>>>>>
>>>>>> Then we ask for another value, the 25% percentile. Since the object
>>>>>> already knows the smaller elements are in the left part, it can use a
>>>>>> select algorithm on the 3 elements left part and extract the value -5
>>>>>> without even trying to sort the upper part of the array.
>>>>>>
>>>>>> What do you think ?
>>>>>>
>>>>>> Luc
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] speeding up percentile based statistics

Reply via email to