Re: [math] speeding up percentile based statistics

Luc Maisonobe Sat, 18 Sep 2010 08:55:23 -0700

Le 18/09/2010 17:39, Luc Maisonobe a écrit :
> Le 18/09/2010 17:36, Phil Steitz a écrit :
>> On 9/18/10 10:24 AM, Luc Maisonobe wrote:
>>> Le 18/09/2010 16:16, Phil Steitz a écrit :
>>>> On 9/17/10 3:19 PM, Luc Maisonobe wrote:
>>>>> Le 17/09/2010 19:55, Ted Dunning a écrit :
>>>>>> There are also on-line percentile estimation methods that require
>>>>>> only a
>>>>>> single pass over the data in order to get good estimates of several
>>>>>> quantile
>>>>>> at the same time.
>>>>>>
>>>>>> If you are interested in getting an estimate of some quantile of the
>>>>>> underlying distribution that generated your data then these on-line
>>>>>> methods
>>>>>> will give you an estimate that is very nearly as accurate as sorting
>>>>>> your
>>>>>> sample.  Sorting gives you the exact quantiles of your sample, but
>>>>>> only an
>>>>>> estimate of the quantiles of your underlying distribution.
>>>>>>
>>>>>> There is a simplified implementation of this in Mahout along with
>>>>>> test cases
>>>>>> that demonstrate reasonable accuracy.
>>>>>>
>>>>>> These techniques are well described in the article: *Incremental
>>>>>> Quantile
>>>>>> Estimation for Massive
>>>>>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *
>>>>>> *
>>>>>> *(available at
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )*
>>>>>
>>>>> Thanks for the pointer, it looks interesting.
>>>>>
>>>>>> *
>>>>>> *
>>>>>> *Regarding the incremental selection method to find multiple
>>>>>> quantiles, I
>>>>>> think that you save a little bit if you are looking for a few
>>>>>> quantiles, but
>>>>>> the added complexity and special cases are going to make testing
>>>>>> difficult.
>>>>>>    Wouldn't it be better to use one of the following strategies:*
>>>>>> *
>>>>>> *
>>>>>> *- keep a copy of the sorted data and if that copy is available, just
>>>>>> use
>>>>>> it.  This cuts the cost of 100 quantiles probed incrementally to a
>>>>>> single
>>>>>> sort.
>>>>>
>>>>> This is exactly what this user did: put the sort out of the loop.
>>>>>
>>>>>> Moreover, since sort is n log n without a radix sort, then the
>>>>>> increment selection algorithm can only win if 100<   log n which is
>>>>>> pretty
>>>>>> unlikely.*
>>>>>
>>>>> I don't understand. Partition-based selection algorithms are basically
>>>>> partition-based sort algorithm where we recurse in only one of the
>>>>> partition once the pivot has been chosen. Subsequent calls therefore
>>>>> don't restart from an array of size n but from smaller sub-arrays, has
>>>>> the pivot can be saved (at least the top level ones). If at the end the
>>>>> number of selections is so high the array ends up to be completely
>>>>> sorted, the total cost is probably not much higher than what an initial
>>>>> sort would have done. It will be higher since their are some
>>>>> bookkeeping
>>>>> to do, but not so much higher I think. Doing one call corresponds to
>>>>> resuming the partial sort from the state resulting from previous calls.
>>>>>
>>>>> Did I miss something ?
>>>>>
>>>>>
>>>>>> *
>>>>>> *
>>>>>> *- add a method to probe multiple quantiles at the same time.  This
>>>>>> potentially uses less memory than the first approach, but is
>>>>>> dependent on
>>>>>> the user calling the right method.*
>>>>>
>>>>> Since there are different use case, having several methods seems
>>>>> fair. A
>>>>> multiple quantiles method is a good idea.
>>>>>
>>>>>> *
>>>>>> *
>>>>>> *- use the on-line algorithm mentioned above with two passes, one to
>>>>>> estimate the quantiles of interest, a second to refine the estimate
>>>>>> using
>>>>>> the actual data.  This allows the multiple probe method to be linear
>>>>>> in time
>>>>>> and should give you exact sample quantiles.  It doesn't help the
>>>>>> repeated
>>>>>> probe problem.*
>>>>>
>>>>> I'm not sure I understand well. However, the on-line method by itself
>>>>> would be an interesting addition. It would allow quantiles computation
>>>>> with the Storeless versions of our classes.
>>>>
>>>> +1 - definitely valuable for the storeless case.
>>>>
>>>> I need to think about this some more / see some patches; but my initial
>>>> reaction is that we can get a big bang from just doing what the user did
>>>> (caching the sorted array and inserting into it when addValue is called)
>>>> for the data-in-memory impls.  I would suggest implementing that for for
>>>> Percentile itself (taking care to handle addValue and rolling windows
>>>> correctly) and DescriptiveStatistics;
>>>
>>> This would however handle only the multiple quantiles use case, not the
>>> case where only one quantile is computed that could be provided with a
>>> selection in O(n) rather than a sort in O(n ln(n)).
>>>
>>
>> Correct.  Ideal would be to do the sort incrementally as you describe
>> for the data-in-memory impl.  Would be tricky to implement in
>> Percentile, given array index parameters and the need to handle addValue
>> and window in DescriptiveStatistics; but this should be doable.
>>
>> Lets open 2 JIRAs for this - one straight performance improvement for
>> Percentile and the other for a storeless percentile estimator.
> 
> OK. I'll open them in a few minutes.


The issues are <https://issues.apache.org/jira/browse/MATH-417> and
<https://issues.apache.org/jira/browse/MATH-418>. I have set the target
version to 2.2 for the first one to to 3.0 for the second one.

I'll give a look at the first one after finishing my current work
(implementing some new random generators).

Luc

> 
> Luc
> 
>>
>> Phil
>>
>>
>>> Luc
>>>
>>>> but add a new (storelesss)
>>>> statistic based on an incremental quantile estimation algorithm and make
>>>> this accessible via the storeless aggregate, SummaryStatistics.  It
>>>> might be interesting to include this optionally with
>>>> DescriptiveStatistics as well, so that in the rolling window case you
>>>> could compare the current window distribution to the full sample.
>>>>
>>>> Phil
>>>>
>>>>
>>>>>
>>>>> Luc
>>>>>
>>>>>> *
>>>>>> *
>>>>>> On Fri, Sep 17, 2010 at 10:34 AM, Luc
>>>>>> Maisonobe<luc.maison...@free.fr>wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> During a recent face to face discussion with some commons-math users
>>>>>>> from a large project, it appeared one implementation choice for
>>>>>>> Percentile statistics was a huge performance bottleneck.
>>>>>>>
>>>>>>> When the Percentile.evaluate method is called, the sample array is
>>>>>>> copied and sorted. In one use case, 100 calls were made for each
>>>>>>> sample,
>>>>>>> so the same array was sorted 100 times. In another use case, only one
>>>>>>> call is made (typically to compute the median) but the user wondered
>>>>>>> why
>>>>>>> the array should be completely sorted. In fact, using a selection
>>>>>>> algorithm instead of a sorting algorithm would be sufficient.
>>>>>>>
>>>>>>> I would like to have you opinion about providing a different evaluate
>>>>>>> method that would process an array provided previously and use only
>>>>>>> selections to provide the percentile. Consider for example the
>>>>>>> following
>>>>>>> input array:
>>>>>>>
>>>>>>>    [ -6, 8, 4, -5, 0, 2, 7 ]
>>>>>>>
>>>>>>> If we first as for the median, a selection algorithm may
>>>>>>> reorganize the
>>>>>>> array as:
>>>>>>>
>>>>>>>    [ -6, 0, -5, 2, 8, 4, 7 ]
>>>>>>>
>>>>>>> were the left part is smaller than the central element, the right
>>>>>>> part
>>>>>>> is larger and hence the central element 2 is known to be the median.
>>>>>>>
>>>>>>> Then we ask for another value, the 25% percentile. Since the object
>>>>>>> already knows the smaller elements are in the left part, it can use a
>>>>>>> select algorithm on the 3 elements left part and extract the value -5
>>>>>>> without even trying to sort the upper part of the array.
>>>>>>>
>>>>>>> What do you think ?
>>>>>>>
>>>>>>> Luc
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] speeding up percentile based statistics

Reply via email to