I was about to write this suggestion down. Instead, I will simply concur.
+1 The safe version can do the checksum or make a copy. Either is safe and the costs are similar (O(n) on every call and no memory or O(1) on every call after the first, O(n) memory). I would tend toward the copy option. On Sun, Sep 26, 2010 at 8:39 AM, Mikkel Meyer Andersen <m...@mikl.dk> wrote: > Hi, > > Why not simply make 3) the default, but also supply a > *IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we > might as well do it instead of not doing any check at all). This means > that users who read the documentation can gain something when they > explicitly ask for it, while users who don't read the docs are still > safe. > > Cheers, Mikkel. > > 2010/9/26 Luc Maisonobe <luc.maison...@free.fr>: > > Le 23/09/2010 20:57, Luc Maisonobe a écrit : > >> Hi, > >> > >> Resuming this thread, as I will be soon able to work on this (I hope). > >> I propose to start by solving > >> <https://issues.apache.org/jira/browse/MATH-417> using a simple > >> partition-based selection algorithm and preserving pivot information. > >> This would help both the single call use-case (we don't sort everything) > >> and the multiple call case (we don't redo on subsequent calls work > >> already done in the first calls. > > > > The current API of the Percentile method is derived from > > AbstractUnivariateStatistic and provides several evaluate methods that > > have the data array as a parameter. > > > > This is compatible with the one-call use case without any semantic > > change, but does not fit the multiple-calls use case seamlessly. > > > > If we want to cache some work already done on the array between calls > > with different values of p (the percentile value), we have to check if > > the array is still the same as the cached one on which we have done some > > partitioning and saved some pivots. This can be done by storing a > > reference to the original array in addition to a copy of its content > > (which we reorganize) and with an "if (values == originalValues)" > > statement. This would however not protect against changes in the array > > elements themselves done by the user between calls. A better protection > > could be to compute a hash code of the array content and compare it to > > the original one at each call. However, this would add a linear cost. > > The user who encountered the performances problems added an addValues > > method to set up the array once and simply ignored the values passed as > > an argument to the evaluate methods. This breaks the semantics of the > API. > > > > So I see four choices about partitioning caching and would like to have > > people express which one they prefer. A clearCache or similar public > > method would be made available to users to ensure clean computation if > > the caching mechanism gets on their way. > > > > 1) do nothing to check the array is the same between calls and blindly > > assumes it IS the same. Users would really need to call clearCache > > when they provide a new array > > pros: very simple > > cons: error-prone for the user as it relies only on reading and > > understanding a documentation that would change with new version > > > > 2) check only array reference equality (with ==) to check the array > > is the same as the previous call or not. Users would need to call > > clearCache only if they use the same array but have changed its > > content. > > pros: trade-off between efficiency and reliability, > > handles most cases properly > > cons: may be wrong in corner cases > > > > 3) check array content using an hash code. Users would generally don't > > need to call clearCache at all so it would not be provided > > pros: works in all cases > > cons: add linear cost for array checking > > > > 4) remove the double[] values parameter from the API and use a separate > > addValues method, hence the class will reset its cache only when > > addValues is called > > pros: works in all cases > > cons: Percentile could not implement UnivariateStatistic anymore > > > > My preference is choice 2. > > > > What do you think ? > > > > Luc > > > >> > >> Partition-based algorithms are O(n) in expected time but not in worst > >> case time. If end users ask for it later, we could improve again by > >> adding the Musser's introselect that guarantees O(n) behavior even in > >> worst case. > >> > >> Another possibility would be to have an additional method that would > >> compute percentiles for several thresholds at once, but this would > >> require returning an array of results and I think it would be difficult > >> to mix with other statistics (mainly the DescriptiveStatistics.apply > >> method, which is used by the users I talked to). > >> > >> Would this meet everyone concerns ? > >> Luc > >> > >> > >> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit : > >>> Ted, > >>> > >>>> O(n), not n > >>>> > >>>> Expected case is n + n/2 + n/4 ... < 2n > >>> > >>> Yes, that I am already more incline to buy. > >>> > >>> Dim. > >>> > ---------------------------------------------------------------------------- > >>> > >>> Dimitri Pourbaix * > >>> Institut d'Astronomie et d'Astrophysique * Don't worry, be happy > >>> CP 226, office 2.N4.211, building NO * and CARPE DIEM. > >>> Universite Libre de Bruxelles * > >>> Boulevard du Triomphe * Tel : +32-2-650.35.71 > >>> B-1050 Bruxelles * Fax : +32-2-650.42.26 > >>> http://sb9.astro.ulb.ac.be/~pourbaix * mailto: > pourb...@astro.ulb.ac.be > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >>> For additional commands, e-mail: dev-h...@commons.apache.org > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >> For additional commands, e-mail: dev-h...@commons.apache.org > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >