On 9/26/10 12:51 PM, Luc Maisonobe wrote:
Le 26/09/2010 18:21, Ted Dunning a écrit :
I was about to write this suggestion down.
Instead, I will simply concur.
+1
The safe version can do the checksum or make a copy. Either is safe and the
costs are similar (O(n) on every call and no memory or O(1) on every call
after the first, O(n) memory).
OK. Then I will have a copy of the data in the class and for each
evaluate method with a double[] values parameter there will also be a
method without this parameter that will run on the already provided
array. The implementations with the value parameter will have an
implementation roughly similar to:
public double evaluate(final double[] values, final double p) {
checkCache(values);
return evaluate(p);
}
private void checkCache(final double[] values) {
final int hash = Arrays.hashcode(values);
if ((values != originalValues) || (hash != originalHash)) {
clearCache();
originalValues = values;
originalHash = hash;
copiedValues = values.clone();
}
}
Starting to work on it ...
-0 on the approach. Be careful with
1) array offset arguments. The contract of evaluate(p) will have to
be evaluation on the same segment of the same array used in the last
call.
2) For DescriptiveStatistics, the cache will have to be cleared when
values are added between calls, since getPercentile(p) is already
defined for that class and needs to return the value based on the
current dataset.
The -0 as opposed to +1 is because
a) handling offsets is going to make checkCache more complicated
b) while the probability of collision may be very low in practice,
there is no guarantee that hash collisions won't occur (in fact, it
is certain that they *can* occur) and the result could be *nasty*,
especially given that we are not telling the user we are doing it.
c) users that pass copies of arrays on subsequent calls containing
the same values will not get the benefit of the caching.
It may be better to just add setData(double[]), setData(double[],
start, end) and evaluate(p) (and evaluate()) methods and define the
contract of evaluate(p) to refer to the data added in the setData
methods. We could talk about extending AbstractUnivariateStatistic
similarly, adding setData and evaluate methods similarly - possibly
evaluate(start, end) as well.
Phil
Luc
I would tend toward the copy option.
On Sun, Sep 26, 2010 at 8:39 AM, Mikkel Meyer Andersen<m...@mikl.dk> wrote:
Hi,
Why not simply make 3) the default, but also supply a
*IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we
might as well do it instead of not doing any check at all). This means
that users who read the documentation can gain something when they
explicitly ask for it, while users who don't read the docs are still
safe.
Cheers, Mikkel.
2010/9/26 Luc Maisonobe<luc.maison...@free.fr>:
Le 23/09/2010 20:57, Luc Maisonobe a écrit :
Hi,
Resuming this thread, as I will be soon able to work on this (I hope).
I propose to start by solving
<https://issues.apache.org/jira/browse/MATH-417> using a simple
partition-based selection algorithm and preserving pivot information.
This would help both the single call use-case (we don't sort everything)
and the multiple call case (we don't redo on subsequent calls work
already done in the first calls.
The current API of the Percentile method is derived from
AbstractUnivariateStatistic and provides several evaluate methods that
have the data array as a parameter.
This is compatible with the one-call use case without any semantic
change, but does not fit the multiple-calls use case seamlessly.
If we want to cache some work already done on the array between calls
with different values of p (the percentile value), we have to check if
the array is still the same as the cached one on which we have done some
partitioning and saved some pivots. This can be done by storing a
reference to the original array in addition to a copy of its content
(which we reorganize) and with an "if (values == originalValues)"
statement. This would however not protect against changes in the array
elements themselves done by the user between calls. A better protection
could be to compute a hash code of the array content and compare it to
the original one at each call. However, this would add a linear cost.
The user who encountered the performances problems added an addValues
method to set up the array once and simply ignored the values passed as
an argument to the evaluate methods. This breaks the semantics of the
API.
So I see four choices about partitioning caching and would like to have
people express which one they prefer. A clearCache or similar public
method would be made available to users to ensure clean computation if
the caching mechanism gets on their way.
1) do nothing to check the array is the same between calls and blindly
assumes it IS the same. Users would really need to call clearCache
when they provide a new array
pros: very simple
cons: error-prone for the user as it relies only on reading and
understanding a documentation that would change with new version
2) check only array reference equality (with ==) to check the array
is the same as the previous call or not. Users would need to call
clearCache only if they use the same array but have changed its
content.
pros: trade-off between efficiency and reliability,
handles most cases properly
cons: may be wrong in corner cases
3) check array content using an hash code. Users would generally don't
need to call clearCache at all so it would not be provided
pros: works in all cases
cons: add linear cost for array checking
4) remove the double[] values parameter from the API and use a separate
addValues method, hence the class will reset its cache only when
addValues is called
pros: works in all cases
cons: Percentile could not implement UnivariateStatistic anymore
My preference is choice 2.
What do you think ?
Luc
Partition-based algorithms are O(n) in expected time but not in worst
case time. If end users ask for it later, we could improve again by
adding the Musser's introselect that guarantees O(n) behavior even in
worst case.
Another possibility would be to have an additional method that would
compute percentiles for several thresholds at once, but this would
require returning an array of results and I think it would be difficult
to mix with other statistics (mainly the DescriptiveStatistics.apply
method, which is used by the users I talked to).
Would this meet everyone concerns ?
Luc
Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
Ted,
O(n), not n
Expected case is n + n/2 + n/4 ...< 2n
Yes, that I am already more incline to buy.
Dim.
----------------------------------------------------------------------------
Dimitri Pourbaix *
Institut d'Astronomie et d'Astrophysique * Don't worry, be happy
CP 226, office 2.N4.211, building NO * and CARPE DIEM.
Universite Libre de Bruxelles *
Boulevard du Triomphe * Tel : +32-2-650.35.71
B-1050 Bruxelles * Fax : +32-2-650.42.26
http://sb9.astro.ulb.ac.be/~pourbaix * mailto:
pourb...@astro.ulb.ac.be
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org