Le 18/09/2010 17:39, Luc Maisonobe a écrit : > Le 18/09/2010 17:36, Phil Steitz a écrit : >> On 9/18/10 10:24 AM, Luc Maisonobe wrote: >>> Le 18/09/2010 16:16, Phil Steitz a écrit : >>>> On 9/17/10 3:19 PM, Luc Maisonobe wrote: >>>>> Le 17/09/2010 19:55, Ted Dunning a écrit : >>>>>> There are also on-line percentile estimation methods that require >>>>>> only a >>>>>> single pass over the data in order to get good estimates of several >>>>>> quantile >>>>>> at the same time. >>>>>> >>>>>> If you are interested in getting an estimate of some quantile of the >>>>>> underlying distribution that generated your data then these on-line >>>>>> methods >>>>>> will give you an estimate that is very nearly as accurate as sorting >>>>>> your >>>>>> sample. Sorting gives you the exact quantiles of your sample, but >>>>>> only an >>>>>> estimate of the quantiles of your underlying distribution. >>>>>> >>>>>> There is a simplified implementation of this in Mahout along with >>>>>> test cases >>>>>> that demonstrate reasonable accuracy. >>>>>> >>>>>> These techniques are well described in the article: *Incremental >>>>>> Quantile >>>>>> Estimation for Massive >>>>>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580> >>>>>> >>>>>> >>>>>> * >>>>>> * >>>>>> * >>>>>> *(available at >>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )* >>>>> >>>>> Thanks for the pointer, it looks interesting. >>>>> >>>>>> * >>>>>> * >>>>>> *Regarding the incremental selection method to find multiple >>>>>> quantiles, I >>>>>> think that you save a little bit if you are looking for a few >>>>>> quantiles, but >>>>>> the added complexity and special cases are going to make testing >>>>>> difficult. >>>>>> Wouldn't it be better to use one of the following strategies:* >>>>>> * >>>>>> * >>>>>> *- keep a copy of the sorted data and if that copy is available, just >>>>>> use >>>>>> it. This cuts the cost of 100 quantiles probed incrementally to a >>>>>> single >>>>>> sort. >>>>> >>>>> This is exactly what this user did: put the sort out of the loop. >>>>> >>>>>> Moreover, since sort is n log n without a radix sort, then the >>>>>> increment selection algorithm can only win if 100< log n which is >>>>>> pretty >>>>>> unlikely.* >>>>> >>>>> I don't understand. Partition-based selection algorithms are basically >>>>> partition-based sort algorithm where we recurse in only one of the >>>>> partition once the pivot has been chosen. Subsequent calls therefore >>>>> don't restart from an array of size n but from smaller sub-arrays, has >>>>> the pivot can be saved (at least the top level ones). If at the end the >>>>> number of selections is so high the array ends up to be completely >>>>> sorted, the total cost is probably not much higher than what an initial >>>>> sort would have done. It will be higher since their are some >>>>> bookkeeping >>>>> to do, but not so much higher I think. Doing one call corresponds to >>>>> resuming the partial sort from the state resulting from previous calls. >>>>> >>>>> Did I miss something ? >>>>> >>>>> >>>>>> * >>>>>> * >>>>>> *- add a method to probe multiple quantiles at the same time. This >>>>>> potentially uses less memory than the first approach, but is >>>>>> dependent on >>>>>> the user calling the right method.* >>>>> >>>>> Since there are different use case, having several methods seems >>>>> fair. A >>>>> multiple quantiles method is a good idea. >>>>> >>>>>> * >>>>>> * >>>>>> *- use the on-line algorithm mentioned above with two passes, one to >>>>>> estimate the quantiles of interest, a second to refine the estimate >>>>>> using >>>>>> the actual data. This allows the multiple probe method to be linear >>>>>> in time >>>>>> and should give you exact sample quantiles. It doesn't help the >>>>>> repeated >>>>>> probe problem.* >>>>> >>>>> I'm not sure I understand well. However, the on-line method by itself >>>>> would be an interesting addition. It would allow quantiles computation >>>>> with the Storeless versions of our classes. >>>> >>>> +1 - definitely valuable for the storeless case. >>>> >>>> I need to think about this some more / see some patches; but my initial >>>> reaction is that we can get a big bang from just doing what the user did >>>> (caching the sorted array and inserting into it when addValue is called) >>>> for the data-in-memory impls. I would suggest implementing that for for >>>> Percentile itself (taking care to handle addValue and rolling windows >>>> correctly) and DescriptiveStatistics; >>> >>> This would however handle only the multiple quantiles use case, not the >>> case where only one quantile is computed that could be provided with a >>> selection in O(n) rather than a sort in O(n ln(n)). >>> >> >> Correct. Ideal would be to do the sort incrementally as you describe >> for the data-in-memory impl. Would be tricky to implement in >> Percentile, given array index parameters and the need to handle addValue >> and window in DescriptiveStatistics; but this should be doable. >> >> Lets open 2 JIRAs for this - one straight performance improvement for >> Percentile and the other for a storeless percentile estimator. > > OK. I'll open them in a few minutes.
The issues are <https://issues.apache.org/jira/browse/MATH-417> and <https://issues.apache.org/jira/browse/MATH-418>. I have set the target version to 2.2 for the first one to to 3.0 for the second one. I'll give a look at the first one after finishing my current work (implementing some new random generators). Luc > > Luc > >> >> Phil >> >> >>> Luc >>> >>>> but add a new (storelesss) >>>> statistic based on an incremental quantile estimation algorithm and make >>>> this accessible via the storeless aggregate, SummaryStatistics. It >>>> might be interesting to include this optionally with >>>> DescriptiveStatistics as well, so that in the rolling window case you >>>> could compare the current window distribution to the full sample. >>>> >>>> Phil >>>> >>>> >>>>> >>>>> Luc >>>>> >>>>>> * >>>>>> * >>>>>> On Fri, Sep 17, 2010 at 10:34 AM, Luc >>>>>> Maisonobe<luc.maison...@free.fr>wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> During a recent face to face discussion with some commons-math users >>>>>>> from a large project, it appeared one implementation choice for >>>>>>> Percentile statistics was a huge performance bottleneck. >>>>>>> >>>>>>> When the Percentile.evaluate method is called, the sample array is >>>>>>> copied and sorted. In one use case, 100 calls were made for each >>>>>>> sample, >>>>>>> so the same array was sorted 100 times. In another use case, only one >>>>>>> call is made (typically to compute the median) but the user wondered >>>>>>> why >>>>>>> the array should be completely sorted. In fact, using a selection >>>>>>> algorithm instead of a sorting algorithm would be sufficient. >>>>>>> >>>>>>> I would like to have you opinion about providing a different evaluate >>>>>>> method that would process an array provided previously and use only >>>>>>> selections to provide the percentile. Consider for example the >>>>>>> following >>>>>>> input array: >>>>>>> >>>>>>> [ -6, 8, 4, -5, 0, 2, 7 ] >>>>>>> >>>>>>> If we first as for the median, a selection algorithm may >>>>>>> reorganize the >>>>>>> array as: >>>>>>> >>>>>>> [ -6, 0, -5, 2, 8, 4, 7 ] >>>>>>> >>>>>>> were the left part is smaller than the central element, the right >>>>>>> part >>>>>>> is larger and hence the central element 2 is known to be the median. >>>>>>> >>>>>>> Then we ask for another value, the 25% percentile. Since the object >>>>>>> already knows the smaller elements are in the left part, it can use a >>>>>>> select algorithm on the 3 elements left part and extract the value -5 >>>>>>> without even trying to sort the upper part of the array. >>>>>>> >>>>>>> What do you think ? >>>>>>> >>>>>>> Luc >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>> For additional commands, e-mail: dev-h...@commons.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org