Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Gilles Wed, 18 Jun 2014 08:08:15 -0700

On Wed, 18 Jun 2014 16:39:12 +0200, Luc Maisonobe wrote:

Hi Gilles and Venkat,
Le 18/06/2014 15:40, Gilles a écrit :
On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
Le 18/06/2014 14:32, Gilles a écrit :
Hello Luc.
See
  https://issues.apache.org/jira/browse/MATH-1129
The problem reported was due to the sorting procedure notbehaving
correctly in the presence of NaN.
This procedure could be replaced by an equivalent method fromthe JDK:
  java.util.Arrays.sort(double[],int,int)

Any objection?
If you imply removing the select, medianOf3 and partitionmethods,
yes I
would object. If you imply replacing only the insertionSortmethod used
for small sub-arrays, then I almost agree.
Issue 1129 concerns the latter: See the comment in "insertionSort"in
the
current code.
However, for the former, you should really have a look atMATH-1120
  https://issues.apache.org/jira/browse/MATH-1120
The proposed patch indeed touches those methods.
So I think it is worth fixing MATH-1120 first, with the NaNStrategyand
go back to MATH-1129 afterwards, to see if it still applies of if
MATH-1120 also fixed it.
As per my last comment on MATH-1120, the "NaNStrategy" part of thepatch
is an extension of the initial feature request.
Sure, but it is a really good addition and seems fair to consider.
As per the Javadoc, the CM code was originally meant to handle (insomeway), the presence of NaN values within the data. So I thought thatthis
should be fixed before extending the API (which entailed additional
questions as the proposed patch is fairly "massive").
Yes, the patch is a big one, and it has been works on for a while.
As it has quite stabilized by now, I think it would be the good time
to include it (with some editing I could do) and to start workingwith
separate patches applied on top of this one.

If we let it out of the code longer, it will become more and more
difficult for the contributor to work on it and to us to include it.
Especially, the new code is clearly untested for performance, andthis
seems to be an important argument by your previous post.
Yes, it is important as some issues were raised on it (if yourememberwell, they were raised by one researcher from another lab working onthe
same big project as you, just after the symposium when we met).
So, it is possible to add "isNaN" conditional, as I did in"insertionSort"
and fix the current code without additional impact.
I really consider it as an ugly hack rather than attempting to really
fix the mess I did in this class.
Or, we redesign (as per MATH-1120) which implies forgetting about
performance for now. The patch removes "insertionSort" altogether!
Well, it is just one line calling
  Arrays.sort(work, begin, end);
I can put the insertion sort back here while editing the patch before
committing it.
IMHO, it is too many changes at once...
You know the popular saying about « premature optimization ». I thinkitwould be fair to temporarily forget about performance, fix the issueand
set up a new ground to work on based on MATH-1120 patch, then improve
this by small corrections once the first huge patch has beencommitted,
and then look back at performances.


That's fine with me. I was really waiting from someone else to apply
this patch. ;-)

Gilles

What is the meaning of percentile when NaNs are kept in the data
(strategy "FIXED")?
For all the other strategies, the NaNs will not cause a problem
within the algorithms being discussed here (being replaced by +inf
or -inf, or removed, or raising an exception).
This is one point we should really discuss, but it can be done later.
I agree with you, FIXED is not appropriate here. We could simplyraisean exception if it is selected as not being meaningful for thisalgorithm.
I will have a look at the proposed patch, including your comments.
Thanks.
So I did have a look and really think it is worth applying it. Iagreewith your comments and will include your change for the NaNsrecoding.
This change alone would solve MATH-1129 by the way.

I also am quite puzzled with the various default strategies and would
propose only one. I would even suggest to select immediately R8instead
of our legacy method, as it would nevertheless be an improvement and
would not break existing code (only provide better results).
As there are several customization area (method, pivoting NaNstrategy),it would also be interesting to use the builder approach we startedin
optimization. This would also allow us to pass an argument for the
RANDOM pivoting where the user could select a specific randomgenerator
with specific seed, I really do not like having a random generator
blindly created with the automatic time dependent seed.

Another point I already changed is the medianOf3 method throwing an
exception in addition to being deprecated. It is really not friendlyto
users. So I replaced the exception by a call to the appropriate enum
method (but of course let the deprecation in place).
I also changed the pivoting strategy to be an instance field ratherthan
a class field, just as all the other customization parameters.

best regards,
Luc
Gilles
[...]



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Reply via email to