Re: [math] speeding up percentile based statistics

Luc Maisonobe Sun, 26 Sep 2010 09:52:02 -0700

Le 26/09/2010 18:21, Ted Dunning a écrit :
> I was about to write this suggestion down.
> 
> Instead, I will simply concur.
> 
> +1
> 
> The safe version can do the checksum or make a copy.  Either is safe and the
> costs are similar (O(n) on every call and no memory or O(1) on every call
> after the first, O(n) memory).


OK. Then I will have a copy of the data in the class and for each
evaluate method with a double[] values parameter there will also be a
method without this parameter that will run on the already provided
array. The implementations with the value parameter will have an
implementation roughly similar to:

public double evaluate(final double[] values, final double p) {
  checkCache(values);
  return evaluate(p);
}

private void checkCache(final double[] values) {
  final int hash = Arrays.hashcode(values);
  if ((values != originalValues) || (hash != originalHash)) {
     clearCache();
     originalValues = values;
     originalHash   = hash;
     copiedValues   = values.clone();
  }
}

Starting to work on it ...

Luc

> 
> I would tend toward the copy option.
> 
> On Sun, Sep 26, 2010 at 8:39 AM, Mikkel Meyer Andersen <[email protected]> wrote:
> 
>> Hi,
>>
>> Why not simply make 3) the default, but also supply a
>> *IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we
>> might as well do it instead of not doing any check at all). This means
>> that users who read the documentation can gain something when they
>> explicitly ask for it, while users who don't read the docs are still
>> safe.
>>
>> Cheers, Mikkel.
>>
>> 2010/9/26 Luc Maisonobe <[email protected]>:
>>> Le 23/09/2010 20:57, Luc Maisonobe a écrit :
>>>> Hi,
>>>>
>>>> Resuming this thread, as I will be soon able to work on this (I hope).
>>>> I propose to start by solving
>>>> <https://issues.apache.org/jira/browse/MATH-417> using a simple
>>>> partition-based selection algorithm and preserving pivot information.
>>>> This would help both the single call use-case (we don't sort everything)
>>>> and the multiple call case (we don't redo on subsequent calls work
>>>> already done in the first calls.
>>>
>>> The current API of the Percentile method is derived from
>>> AbstractUnivariateStatistic and provides several evaluate methods that
>>> have the data array as a parameter.
>>>
>>> This is compatible with the one-call use case without any semantic
>>> change, but does not fit the multiple-calls use case seamlessly.
>>>
>>> If we want to cache some work already done on the array between calls
>>> with different values of p (the percentile value), we have to check if
>>> the array is still the same as the cached one on which we have done some
>>> partitioning and saved some pivots. This can be done by storing a
>>> reference to the original array in addition to a copy of its content
>>> (which we reorganize) and with an "if (values == originalValues)"
>>> statement. This would however not protect against changes in the array
>>> elements themselves done by the user between calls. A better protection
>>> could be to compute a hash code of the array content and compare it to
>>> the original one at each call. However, this would add a linear cost.
>>> The user who encountered the performances problems added an addValues
>>> method to set up the array once and simply ignored the values passed as
>>> an argument to the evaluate methods. This breaks the semantics of the
>> API.
>>>
>>> So I see four choices about partitioning caching and would like to have
>>> people express which one they prefer. A clearCache or similar public
>>> method would be made available to users to ensure clean computation if
>>> the caching mechanism gets on their way.
>>>
>>>  1) do nothing to check the array is the same between calls and blindly
>>>    assumes it IS the same. Users would really need to call clearCache
>>>    when they provide a new array
>>>    pros: very simple
>>>    cons: error-prone for the user as it relies only on reading and
>>>    understanding a documentation that would change with new version
>>>
>>>  2) check only array reference equality (with ==) to check the array
>>>    is the same as the previous call or not. Users would need to call
>>>    clearCache only if they use the same array but have changed its
>>>    content.
>>>    pros: trade-off between efficiency and reliability,
>>>          handles most cases properly
>>>    cons: may be wrong in corner cases
>>>
>>>  3) check array content using an hash code. Users would generally don't
>>>    need to call clearCache at all so it would not be provided
>>>    pros: works in all cases
>>>    cons: add linear cost for array checking
>>>
>>>  4) remove the double[] values parameter from the API and use a separate
>>>    addValues method, hence the class will reset its cache only when
>>>    addValues is called
>>>    pros: works in all cases
>>>    cons: Percentile could not implement UnivariateStatistic anymore
>>>
>>> My preference is choice 2.
>>>
>>> What do you think ?
>>>
>>> Luc
>>>
>>>>
>>>> Partition-based algorithms are O(n) in expected time but not in worst
>>>> case time. If end users ask for it later, we could improve again by
>>>> adding the Musser's introselect that guarantees O(n) behavior even in
>>>> worst case.
>>>>
>>>> Another possibility would be to have an additional method that would
>>>> compute percentiles for several thresholds at once, but this would
>>>> require returning an array of results and I think it would be difficult
>>>> to mix with other statistics (mainly the DescriptiveStatistics.apply
>>>> method, which is used by the users I talked to).
>>>>
>>>> Would this meet everyone concerns ?
>>>> Luc
>>>>
>>>>
>>>> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
>>>>> Ted,
>>>>>
>>>>>> O(n), not n
>>>>>>
>>>>>> Expected case is n + n/2 + n/4 ... < 2n
>>>>>
>>>>> Yes, that I am already more incline to buy.
>>>>>
>>>>> Dim.
>>>>>
>> ----------------------------------------------------------------------------
>>>>>
>>>>> Dimitri Pourbaix                         *
>>>>> Institut d'Astronomie et d'Astrophysique *      Don't worry, be happy
>>>>> CP 226, office 2.N4.211, building NO     *         and CARPE DIEM.
>>>>> Universite Libre de Bruxelles            *
>>>>> Boulevard du Triomphe                    *      Tel : +32-2-650.35.71
>>>>>  B-1050 Bruxelles                        *      Fax : +32-2-650.42.26
>>>>> http://sb9.astro.ulb.ac.be/~pourbaix     * mailto:
>> [email protected]
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [math] speeding up percentile based statistics

Reply via email to