Le 18/09/2010 16:16, Phil Steitz a écrit : > On 9/17/10 3:19 PM, Luc Maisonobe wrote: >> Le 17/09/2010 19:55, Ted Dunning a écrit : >>> There are also on-line percentile estimation methods that require only a >>> single pass over the data in order to get good estimates of several >>> quantile >>> at the same time. >>> >>> If you are interested in getting an estimate of some quantile of the >>> underlying distribution that generated your data then these on-line >>> methods >>> will give you an estimate that is very nearly as accurate as sorting >>> your >>> sample. Sorting gives you the exact quantiles of your sample, but >>> only an >>> estimate of the quantiles of your underlying distribution. >>> >>> There is a simplified implementation of this in Mahout along with >>> test cases >>> that demonstrate reasonable accuracy. >>> >>> These techniques are well described in the article: *Incremental >>> Quantile >>> Estimation for Massive >>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580> >>> >>> * >>> * >>> * >>> *(available at >>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )* >> >> Thanks for the pointer, it looks interesting. >> >>> * >>> * >>> *Regarding the incremental selection method to find multiple >>> quantiles, I >>> think that you save a little bit if you are looking for a few >>> quantiles, but >>> the added complexity and special cases are going to make testing >>> difficult. >>> Wouldn't it be better to use one of the following strategies:* >>> * >>> * >>> *- keep a copy of the sorted data and if that copy is available, just >>> use >>> it. This cuts the cost of 100 quantiles probed incrementally to a >>> single >>> sort. >> >> This is exactly what this user did: put the sort out of the loop. >> >>> Moreover, since sort is n log n without a radix sort, then the >>> increment selection algorithm can only win if 100< log n which is >>> pretty >>> unlikely.* >> >> I don't understand. Partition-based selection algorithms are basically >> partition-based sort algorithm where we recurse in only one of the >> partition once the pivot has been chosen. Subsequent calls therefore >> don't restart from an array of size n but from smaller sub-arrays, has >> the pivot can be saved (at least the top level ones). If at the end the >> number of selections is so high the array ends up to be completely >> sorted, the total cost is probably not much higher than what an initial >> sort would have done. It will be higher since their are some bookkeeping >> to do, but not so much higher I think. Doing one call corresponds to >> resuming the partial sort from the state resulting from previous calls. >> >> Did I miss something ? >> >> >>> * >>> * >>> *- add a method to probe multiple quantiles at the same time. This >>> potentially uses less memory than the first approach, but is >>> dependent on >>> the user calling the right method.* >> >> Since there are different use case, having several methods seems fair. A >> multiple quantiles method is a good idea. >> >>> * >>> * >>> *- use the on-line algorithm mentioned above with two passes, one to >>> estimate the quantiles of interest, a second to refine the estimate >>> using >>> the actual data. This allows the multiple probe method to be linear >>> in time >>> and should give you exact sample quantiles. It doesn't help the >>> repeated >>> probe problem.* >> >> I'm not sure I understand well. However, the on-line method by itself >> would be an interesting addition. It would allow quantiles computation >> with the Storeless versions of our classes. > > +1 - definitely valuable for the storeless case. > > I need to think about this some more / see some patches; but my initial > reaction is that we can get a big bang from just doing what the user did > (caching the sorted array and inserting into it when addValue is called) > for the data-in-memory impls. I would suggest implementing that for for > Percentile itself (taking care to handle addValue and rolling windows > correctly) and DescriptiveStatistics;
This would however handle only the multiple quantiles use case, not the case where only one quantile is computed that could be provided with a selection in O(n) rather than a sort in O(n ln(n)). Luc > but add a new (storelesss) > statistic based on an incremental quantile estimation algorithm and make > this accessible via the storeless aggregate, SummaryStatistics. It > might be interesting to include this optionally with > DescriptiveStatistics as well, so that in the rolling window case you > could compare the current window distribution to the full sample. > > Phil > > >> >> Luc >> >>> * >>> * >>> On Fri, Sep 17, 2010 at 10:34 AM, Luc >>> Maisonobe<luc.maison...@free.fr>wrote: >>> >>>> Hi all, >>>> >>>> During a recent face to face discussion with some commons-math users >>>> from a large project, it appeared one implementation choice for >>>> Percentile statistics was a huge performance bottleneck. >>>> >>>> When the Percentile.evaluate method is called, the sample array is >>>> copied and sorted. In one use case, 100 calls were made for each >>>> sample, >>>> so the same array was sorted 100 times. In another use case, only one >>>> call is made (typically to compute the median) but the user wondered >>>> why >>>> the array should be completely sorted. In fact, using a selection >>>> algorithm instead of a sorting algorithm would be sufficient. >>>> >>>> I would like to have you opinion about providing a different evaluate >>>> method that would process an array provided previously and use only >>>> selections to provide the percentile. Consider for example the >>>> following >>>> input array: >>>> >>>> [ -6, 8, 4, -5, 0, 2, 7 ] >>>> >>>> If we first as for the median, a selection algorithm may reorganize the >>>> array as: >>>> >>>> [ -6, 0, -5, 2, 8, 4, 7 ] >>>> >>>> were the left part is smaller than the central element, the right part >>>> is larger and hence the central element 2 is known to be the median. >>>> >>>> Then we ask for another value, the 25% percentile. Since the object >>>> already knows the smaller elements are in the left part, it can use a >>>> select algorithm on the 3 elements left part and extract the value -5 >>>> without even trying to sort the upper part of the array. >>>> >>>> What do you think ? >>>> >>>> Luc >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org