Le 18/09/2010 17:36, Phil Steitz a écrit : > On 9/18/10 10:24 AM, Luc Maisonobe wrote: >> Le 18/09/2010 16:16, Phil Steitz a écrit : >>> On 9/17/10 3:19 PM, Luc Maisonobe wrote: >>>> Le 17/09/2010 19:55, Ted Dunning a écrit : >>>>> There are also on-line percentile estimation methods that require >>>>> only a >>>>> single pass over the data in order to get good estimates of several >>>>> quantile >>>>> at the same time. >>>>> >>>>> If you are interested in getting an estimate of some quantile of the >>>>> underlying distribution that generated your data then these on-line >>>>> methods >>>>> will give you an estimate that is very nearly as accurate as sorting >>>>> your >>>>> sample. Sorting gives you the exact quantiles of your sample, but >>>>> only an >>>>> estimate of the quantiles of your underlying distribution. >>>>> >>>>> There is a simplified implementation of this in Mahout along with >>>>> test cases >>>>> that demonstrate reasonable accuracy. >>>>> >>>>> These techniques are well described in the article: *Incremental >>>>> Quantile >>>>> Estimation for Massive >>>>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580> >>>>> >>>>> >>>>> * >>>>> * >>>>> * >>>>> *(available at >>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )* >>>> >>>> Thanks for the pointer, it looks interesting. >>>> >>>>> * >>>>> * >>>>> *Regarding the incremental selection method to find multiple >>>>> quantiles, I >>>>> think that you save a little bit if you are looking for a few >>>>> quantiles, but >>>>> the added complexity and special cases are going to make testing >>>>> difficult. >>>>> Wouldn't it be better to use one of the following strategies:* >>>>> * >>>>> * >>>>> *- keep a copy of the sorted data and if that copy is available, just >>>>> use >>>>> it. This cuts the cost of 100 quantiles probed incrementally to a >>>>> single >>>>> sort. >>>> >>>> This is exactly what this user did: put the sort out of the loop. >>>> >>>>> Moreover, since sort is n log n without a radix sort, then the >>>>> increment selection algorithm can only win if 100< log n which is >>>>> pretty >>>>> unlikely.* >>>> >>>> I don't understand. Partition-based selection algorithms are basically >>>> partition-based sort algorithm where we recurse in only one of the >>>> partition once the pivot has been chosen. Subsequent calls therefore >>>> don't restart from an array of size n but from smaller sub-arrays, has >>>> the pivot can be saved (at least the top level ones). If at the end the >>>> number of selections is so high the array ends up to be completely >>>> sorted, the total cost is probably not much higher than what an initial >>>> sort would have done. It will be higher since their are some >>>> bookkeeping >>>> to do, but not so much higher I think. Doing one call corresponds to >>>> resuming the partial sort from the state resulting from previous calls. >>>> >>>> Did I miss something ? >>>> >>>> >>>>> * >>>>> * >>>>> *- add a method to probe multiple quantiles at the same time. This >>>>> potentially uses less memory than the first approach, but is >>>>> dependent on >>>>> the user calling the right method.* >>>> >>>> Since there are different use case, having several methods seems >>>> fair. A >>>> multiple quantiles method is a good idea. >>>> >>>>> * >>>>> * >>>>> *- use the on-line algorithm mentioned above with two passes, one to >>>>> estimate the quantiles of interest, a second to refine the estimate >>>>> using >>>>> the actual data. This allows the multiple probe method to be linear >>>>> in time >>>>> and should give you exact sample quantiles. It doesn't help the >>>>> repeated >>>>> probe problem.* >>>> >>>> I'm not sure I understand well. However, the on-line method by itself >>>> would be an interesting addition. It would allow quantiles computation >>>> with the Storeless versions of our classes. >>> >>> +1 - definitely valuable for the storeless case. >>> >>> I need to think about this some more / see some patches; but my initial >>> reaction is that we can get a big bang from just doing what the user did >>> (caching the sorted array and inserting into it when addValue is called) >>> for the data-in-memory impls. I would suggest implementing that for for >>> Percentile itself (taking care to handle addValue and rolling windows >>> correctly) and DescriptiveStatistics; >> >> This would however handle only the multiple quantiles use case, not the >> case where only one quantile is computed that could be provided with a >> selection in O(n) rather than a sort in O(n ln(n)). >> > > Correct. Ideal would be to do the sort incrementally as you describe > for the data-in-memory impl. Would be tricky to implement in > Percentile, given array index parameters and the need to handle addValue > and window in DescriptiveStatistics; but this should be doable. > > Lets open 2 JIRAs for this - one straight performance improvement for > Percentile and the other for a storeless percentile estimator.
OK. I'll open them in a few minutes. Luc > > Phil > > >> Luc >> >>> but add a new (storelesss) >>> statistic based on an incremental quantile estimation algorithm and make >>> this accessible via the storeless aggregate, SummaryStatistics. It >>> might be interesting to include this optionally with >>> DescriptiveStatistics as well, so that in the rolling window case you >>> could compare the current window distribution to the full sample. >>> >>> Phil >>> >>> >>>> >>>> Luc >>>> >>>>> * >>>>> * >>>>> On Fri, Sep 17, 2010 at 10:34 AM, Luc >>>>> Maisonobe<luc.maison...@free.fr>wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> During a recent face to face discussion with some commons-math users >>>>>> from a large project, it appeared one implementation choice for >>>>>> Percentile statistics was a huge performance bottleneck. >>>>>> >>>>>> When the Percentile.evaluate method is called, the sample array is >>>>>> copied and sorted. In one use case, 100 calls were made for each >>>>>> sample, >>>>>> so the same array was sorted 100 times. In another use case, only one >>>>>> call is made (typically to compute the median) but the user wondered >>>>>> why >>>>>> the array should be completely sorted. In fact, using a selection >>>>>> algorithm instead of a sorting algorithm would be sufficient. >>>>>> >>>>>> I would like to have you opinion about providing a different evaluate >>>>>> method that would process an array provided previously and use only >>>>>> selections to provide the percentile. Consider for example the >>>>>> following >>>>>> input array: >>>>>> >>>>>> [ -6, 8, 4, -5, 0, 2, 7 ] >>>>>> >>>>>> If we first as for the median, a selection algorithm may >>>>>> reorganize the >>>>>> array as: >>>>>> >>>>>> [ -6, 0, -5, 2, 8, 4, 7 ] >>>>>> >>>>>> were the left part is smaller than the central element, the right >>>>>> part >>>>>> is larger and hence the central element 2 is known to be the median. >>>>>> >>>>>> Then we ask for another value, the 25% percentile. Since the object >>>>>> already knows the smaller elements are in the left part, it can use a >>>>>> select algorithm on the 3 elements left part and extract the value -5 >>>>>> without even trying to sort the upper part of the array. >>>>>> >>>>>> What do you think ? >>>>>> >>>>>> Luc >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>> >>>>>> >>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>> For additional commands, e-mail: dev-h...@commons.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org