I was about to write this suggestion down.

Instead, I will simply concur.

+1

The safe version can do the checksum or make a copy.  Either is safe and the
costs are similar (O(n) on every call and no memory or O(1) on every call
after the first, O(n) memory).

I would tend toward the copy option.

On Sun, Sep 26, 2010 at 8:39 AM, Mikkel Meyer Andersen <m...@mikl.dk> wrote:

> Hi,
>
> Why not simply make 3) the default, but also supply a
> *IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we
> might as well do it instead of not doing any check at all). This means
> that users who read the documentation can gain something when they
> explicitly ask for it, while users who don't read the docs are still
> safe.
>
> Cheers, Mikkel.
>
> 2010/9/26 Luc Maisonobe <luc.maison...@free.fr>:
> > Le 23/09/2010 20:57, Luc Maisonobe a écrit :
> >> Hi,
> >>
> >> Resuming this thread, as I will be soon able to work on this (I hope).
> >> I propose to start by solving
> >> <https://issues.apache.org/jira/browse/MATH-417> using a simple
> >> partition-based selection algorithm and preserving pivot information.
> >> This would help both the single call use-case (we don't sort everything)
> >> and the multiple call case (we don't redo on subsequent calls work
> >> already done in the first calls.
> >
> > The current API of the Percentile method is derived from
> > AbstractUnivariateStatistic and provides several evaluate methods that
> > have the data array as a parameter.
> >
> > This is compatible with the one-call use case without any semantic
> > change, but does not fit the multiple-calls use case seamlessly.
> >
> > If we want to cache some work already done on the array between calls
> > with different values of p (the percentile value), we have to check if
> > the array is still the same as the cached one on which we have done some
> > partitioning and saved some pivots. This can be done by storing a
> > reference to the original array in addition to a copy of its content
> > (which we reorganize) and with an "if (values == originalValues)"
> > statement. This would however not protect against changes in the array
> > elements themselves done by the user between calls. A better protection
> > could be to compute a hash code of the array content and compare it to
> > the original one at each call. However, this would add a linear cost.
> > The user who encountered the performances problems added an addValues
> > method to set up the array once and simply ignored the values passed as
> > an argument to the evaluate methods. This breaks the semantics of the
> API.
> >
> > So I see four choices about partitioning caching and would like to have
> > people express which one they prefer. A clearCache or similar public
> > method would be made available to users to ensure clean computation if
> > the caching mechanism gets on their way.
> >
> >  1) do nothing to check the array is the same between calls and blindly
> >    assumes it IS the same. Users would really need to call clearCache
> >    when they provide a new array
> >    pros: very simple
> >    cons: error-prone for the user as it relies only on reading and
> >    understanding a documentation that would change with new version
> >
> >  2) check only array reference equality (with ==) to check the array
> >    is the same as the previous call or not. Users would need to call
> >    clearCache only if they use the same array but have changed its
> >    content.
> >    pros: trade-off between efficiency and reliability,
> >          handles most cases properly
> >    cons: may be wrong in corner cases
> >
> >  3) check array content using an hash code. Users would generally don't
> >    need to call clearCache at all so it would not be provided
> >    pros: works in all cases
> >    cons: add linear cost for array checking
> >
> >  4) remove the double[] values parameter from the API and use a separate
> >    addValues method, hence the class will reset its cache only when
> >    addValues is called
> >    pros: works in all cases
> >    cons: Percentile could not implement UnivariateStatistic anymore
> >
> > My preference is choice 2.
> >
> > What do you think ?
> >
> > Luc
> >
> >>
> >> Partition-based algorithms are O(n) in expected time but not in worst
> >> case time. If end users ask for it later, we could improve again by
> >> adding the Musser's introselect that guarantees O(n) behavior even in
> >> worst case.
> >>
> >> Another possibility would be to have an additional method that would
> >> compute percentiles for several thresholds at once, but this would
> >> require returning an array of results and I think it would be difficult
> >> to mix with other statistics (mainly the DescriptiveStatistics.apply
> >> method, which is used by the users I talked to).
> >>
> >> Would this meet everyone concerns ?
> >> Luc
> >>
> >>
> >> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
> >>> Ted,
> >>>
> >>>> O(n), not n
> >>>>
> >>>> Expected case is n + n/2 + n/4 ... < 2n
> >>>
> >>> Yes, that I am already more incline to buy.
> >>>
> >>> Dim.
> >>>
> ----------------------------------------------------------------------------
> >>>
> >>> Dimitri Pourbaix                         *
> >>> Institut d'Astronomie et d'Astrophysique *      Don't worry, be happy
> >>> CP 226, office 2.N4.211, building NO     *         and CARPE DIEM.
> >>> Universite Libre de Bruxelles            *
> >>> Boulevard du Triomphe                    *      Tel : +32-2-650.35.71
> >>>  B-1050 Bruxelles                        *      Fax : +32-2-650.42.26
> >>> http://sb9.astro.ulb.ac.be/~pourbaix     * mailto:
> pourb...@astro.ulb.ac.be
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >> For additional commands, e-mail: dev-h...@commons.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Reply via email to