On Wed, 18 Jun 2014 16:39:12 +0200, Luc Maisonobe wrote:
Hi Gilles and Venkat,

Le 18/06/2014 15:40, Gilles a écrit :
On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
Le 18/06/2014 14:32, Gilles a écrit :
Hello Luc.


See
  https://issues.apache.org/jira/browse/MATH-1129

The problem reported was due to the sorting procedure not behaving
correctly in the presence of NaN.
This procedure could be replaced by an equivalent method from the JDK:
  java.util.Arrays.sort(double[],int,int)

Any objection?

If you imply removing the select, medianOf3 and partition methods,
yes I
would object. If you imply replacing only the insertionSort method used
for small sub-arrays, then I almost agree.

Issue 1129 concerns the latter: See the comment in "insertionSort" in
the
current code.

However, for the former, you should really have a look at MATH-1120
  https://issues.apache.org/jira/browse/MATH-1120
The proposed patch indeed touches those methods.

So I think it is worth fixing MATH-1120 first, with the NaNStrategy and
go back to MATH-1129 afterwards, to see if it still applies of if
MATH-1120 also fixed it.

As per my last comment on MATH-1120, the "NaNStrategy" part of the patch
is an extension of the initial feature request.

Sure, but it is a really good addition and seems fair to consider.


As per the Javadoc, the CM code was originally meant to handle (in some way), the presence of NaN values within the data. So I thought that this
should be fixed before extending the API (which entailed additional
questions as the proposed patch is fairly "massive").

Yes, the patch is a big one, and it has been works on for a while.
As it has quite stabilized by now, I think it would be the good time
to include it (with some editing I could do) and to start working with
separate patches applied on top of this one.

If we let it out of the code longer, it will become more and more
difficult for the contributor to work on it and to us to include it.


Especially, the new code is clearly untested for performance, and this
seems to be an important argument by your previous post.

Yes, it is important as some issues were raised on it (if you remember well, they were raised by one researcher from another lab working on the
same big project as you, just after the symposium when we met).


So, it is possible to add "isNaN" conditional, as I did in "insertionSort"
and fix the current code without additional impact.

I really consider it as an ugly hack rather than attempting to really
fix the mess I did in this class.


Or, we redesign (as per MATH-1120) which implies forgetting about
performance for now. The patch removes "insertionSort" altogether!

Well, it is just one line calling
  Arrays.sort(work, begin, end);
I can put the insertion sort back here while editing the patch before
committing it.

IMHO, it is too many changes at once...

You know the popular saying about « premature optimization ». I think it would be fair to temporarily forget about performance, fix the issue and
set up a new ground to work on based on MATH-1120 patch, then improve
this by small corrections once the first huge patch has been committed,
and then look back at performances.

That's fine with me. I was really waiting from someone else to apply
this patch. ;-)

Gilles




What is the meaning of percentile when NaNs are kept in the data
(strategy "FIXED")?
For all the other strategies, the NaNs will not cause a problem
within the algorithms being discussed here (being replaced by +inf
or -inf, or removed, or raising an exception).

This is one point we should really discuss, but it can be done later.
I agree with you, FIXED is not appropriate here. We could simply raise an exception if it is selected as not being meaningful for this algorithm.




I will have a look at the proposed patch, including your comments.

Thanks.

So I did have a look and really think it is worth applying it. I agree with your comments and will include your change for the NaNs recoding.
This change alone would solve MATH-1129 by the way.

I also am quite puzzled with the various default strategies and would
propose only one. I would even suggest to select immediately R8 instead
of our legacy method, as it would nevertheless be an improvement and
would not break existing code (only provide better results).

As there are several customization area (method, pivoting NaN strategy), it would also be interesting to use the builder approach we started in
optimization. This would also allow us to pass an argument for the
RANDOM pivoting where the user could select a specific random generator
with specific seed, I really do not like having a random generator
blindly created with the automatic time dependent seed.

Another point I already changed is the medianOf3 method throwing an
exception in addition to being deprecated. It is really not friendly to
users. So I replaced the exception by a call to the appropriate enum
method (but of course let the deprecation in place).

I also changed the pivoting strategy to be an instance field rather than
a class field, just as all the other customization parameters.

best regards,
Luc


Gilles

[...]


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to