Re: [math] use case for NaNStrategy.FIXED?

Luc Maisonobe Mon, 30 Jun 2014 12:34:01 -0700

Le 30/06/2014 19:15, Phil Steitz a écrit :
> 
> 
>> On Jun 30, 2014, at 9:59 AM, Gilles <gil...@harfang.homelinux.org>
>> wrote:
>> 
>> On Mon, 30 Jun 2014 09:08:00 -0700, Phil Steitz wrote:
>>>> On Jun 29, 2014, at 3:41 PM, Gilles
>>>> <gil...@harfang.homelinux.org> wrote:
>>>> 
>>>>>> On Sun, 29 Jun 2014 14:39:51 -0700, Phil Steitz wrote:
>>>>>>> On 6/29/14, 2:30 PM, Gilles wrote:
>>>>>>>> On Sun, 29 Jun 2014 10:25:58 -0700, Phil Steitz wrote: 
>>>>>>>> On 6/29/14, 9:48 AM, venkatesha murthy wrote: On Sun,
>>>>>>>> Jun 29, 2014 at 1:25 AM, Phil Steitz 
>>>>>>>> <phil.ste...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>>> On 6/25/14, 12:12 AM, Luc Maisonobe wrote: Le
>>>>>>>>>> 25/06/2014 06:05, venkatesha murthy a écrit :
>>>>>>>>>>> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe 
>>>>>>>>>>> <l...@spaceroots.org>
>>>>>>>>> wrote:
>>>>>>>>>>>> Hi Venkat,
>>>>>>>>>>>> 
>>>>>>>>>>>> Le 23/06/2014 21:08, venkatesha murthy a écrit
>>>>>>>>>>>> :
>>>>>>>>>>>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc
>>>>>>>>>>>>> Maisonobe <l...@spaceroots.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> While looking further in Percentile class
>>>>>>>>>>>>>> for MATH-1120, I have found another problem
>>>>>>>>>>>>>> in the current implementation. 
>>>>>>>>>>>>>> NaNStrategy.FIXED
>>>>>>>>> should
>>>>>>>>>>>>>> leave the NaNs in place, but at the end of
>>>>>>>>>>>>>> the KthSelector.select method, a call to
>>>>>>>>>>>>>> Arrays.sort moves the NaNs to the end of 
>>>>>>>>>>>>>> the small sub-array. What is really
>>>>>>>>>>>>>> implemented here is therefore closer to 
>>>>>>>>>>>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED.
>>>>>>>>>>>>>> This always occur in the test cases because
>>>>>>>>>>>>>> they use very short arrays, and we 
>>>>>>>>>>>>>> directly
>>>>>>>>> switch to
>>>>>>>>>>>>>> this part of the select method.
>>>>>>>>>>>>> Are NaNs considered higher than +Inf ? If
>>>>>>>>>>>>> MAXIMAL represent replacing for +inf ; you
>>>>>>>>>>>>> need something to indicate beyond this for
>>>>>>>>>>>>> NaN.
>>>>>>>>>>>> Well, we can just keep the NaN themselves and
>>>>>>>>>>>> handled them appropriately, hoping not to trash
>>>>>>>>>>>> performances too much.
>>>>>>>>>>> Agreed.
>>>>>>>>>>> 
>>>>>>>>>>>>> What is the test input you see an issue and
>>>>>>>>>>>>> what is the actual error you are seeing.
>>>>>>>>>>>>> Please share the test case.
>>>>>>>>>>>> Just look at
>>>>>>>>>>>> PercentileTest.testReplaceNanInRange(). The 
>>>>>>>>>>>> first check in the test corresponds to a
>>>>>>>>>>>> Percentile configuration at 50% percentile, and
>>>>>>>>>>>> NaNStrategy.FIXED. The array has an odd number
>>>>>>>>>>>> of entries, so the 50% percentile is exactly
>>>>>>>>>>>> one of the entries: the one at index 5 in the 
>>>>>>>>>>>> final array.
>>>>>>>>>>>> 
>>>>>>>>>>>> The initial ordering is { 0, 1, 2, 3, 4, NaN,
>>>>>>>>>>>> NaN, 5, 7, NaN, 8 }. So for the
>>>>>>>>>>>> NaNStrategy.FIXED setting, it should not be
>>>>>>>>>>>> modified at all in the selection algorithm and
>>>>>>>>>>>> the result for 50% should be the first NaN of
>>>>>>>>>>>> the array, at index 5. In fact, due to the
>>>>>>>>>>>> Arrays.sort, we *do* reorder the array into  {
>>>>>>>>>>>> 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so the
>>>>>>>>>>>> result is 5.
>>>>>>>>>>>> 
>>>>>>>>>>>> Agreed. just verified by putting back the
>>>>>>>>>>>> earlier insertionSort
>>>>>>>>> function.
>>>>>>>>>>>> If we use NaNStrategy.MAXIMAL and any quantile
>>>>>>>>>>>> above 67%, we get as a result
>>>>>>>>>>>> Double.POSITIVE_INFINITY instead of
>>>>>>>>>>>> Double.NaN.
>>>>>>>>>>>> 
>>>>>>>>>>>> If we agree to leave FIXED as unchanged
>>>>>>>>>>>> behaviour with your
>>>>>>>>> insertionSort
>>>>>>>>>>> code; then atleast MAXIMAL/MINIMAL should be
>>>>>>>>>>> allowed for transformation
>>>>>>>>> of
>>>>>>>>>>> NaN to +/-Inf
>>>>>>>>>> I'm fine with it, as long as we clearly state it in
>>>>>>>>>> the NaNStrategy documentation, saying somethig
>>>>>>>>>> like:
>>>>>>>>>> 
>>>>>>>>>> When MAXIMAL/MINIMAL are used, the NaNs are
>>>>>>>>>> effecively *replaced* by +/- infinity, hence the
>>>>>>>>>> results will be computed as if infinities were 
>>>>>>>>>> there in the first place.
>>>>>>>>> -1 - this is mixing concerns.  NaNStrategy exists for
>>>>>>>>> one specific purpose - to turn extend the ordering on
>>>>>>>>> doubles a total ordering. Strategies prescribe only
>>>>>>>>> how NaNs are to be treated in the ordering.  Side
>>>>>>>>> effects on the input array don't make sense in this 
>>>>>>>>> context.  The use case for which this class was
>>>>>>>>> created was rank transformations.  Returning
>>>>>>>>> infinities in rank transformations would blow up
>>>>>>>>> computations in these classes.  If a floating point 
>>>>>>>>> computations need to transform NaNs, the specific
>>>>>>>>> stats / other classes that do this transformation
>>>>>>>>> should perform and document the behavior.
>>>>>>>>> 
>>>>>>>>> Phil
>>>>>>>> OK. Agreed realized it later.  Hence i have not
>>>>>>>> touched NaNStrategy in my patch(MATH-1132) . Now i have
>>>>>>>> added a separate transformer to transform NaNs but it
>>>>>>>> uses NanStrategy.  Please let me know on this as this 
>>>>>>>> trasnformation itself can be used in different places.
>>>>>>> 
>>>>>>> Honestly, I don't see the value of this.  I would be more
>>>>>>> favorable to an addition to MathArrays supporting NaN (or
>>>>>>> infinity) filtering / transformation.  Several years back
>>>>>>> we made the decision to just let NaNs propagate in
>>>>>>> computations, which basically means that if you feed NaNs
>>>>>>> to [math] you are going to get NaNs out in results. If we
>>>>>>> do get into the business of filtering NaNs (or infinities
>>>>>>> for that matter), I think it is probably best to do that
>>>>>>> in MathArrays or somesuch - i.e., don't decorate APIs
>>>>>>> with NaN or infinity-handling handling descriptions.
>>>>>>> That would be a huge task and would likely end up
>>>>>>> bleeding implementation detail into the APIs.  As I said
>>>>>>> above, NaNStrategy has to exist for rank transformations
>>>>>>> to be well-defined.  NaN-handling in floating point 
>>>>>>> computations is defined and as long as we fully document
>>>>>>> what our algorithms do, I think it is fair enough to "let
>>>>>>> the NaNs fall where they may" - which basically means
>>>>>>> users need to do the filtering / transformation
>>>>>>> themselves.
>>>>>> 
>>>>>> So, do I understand correctly that it's OK to add the
>>>>>> "remove" and "replace" utility methods (cf. MATH-1130) in
>>>>>> "MathArrays"? Or does this thread conclude that we don't
>>>>>> have a use for this within Commons Math?
>>>>> 
>>>>> You raise a good point here.  Personally, I don't see an
>>>>> internal use and if we are sticking with MathArrays including
>>>>> only things we have internal use for, I would say no, don't
>>>>> include them.
>>>> 
>>>> A third way to look at it would be that since some algorithms
>>>> require being passed "clean" data (e.g. without NaNs), we could
>>>> provide some simple means to do so. A user could then write,
>>>> e.g. --- double[] data = { 1, Double.NaN, 3, 6, 5 };
>>>> 
>>>> Percentile p = new Percentile(); double q =
>>>> p.evaluate(MathArrays.replace(data, Double.NaN,
>>>> Double.POSITIVE_INFINITY), 0.1); ---
>>>> 
>>>> Then, if we provide that utility, is it still necessary to
>>>> complicate the Percentile code with a NaNStrategy parameter?
>>> 
>>> I may be missing something, but it seems to me that percentile
>>> does need a NaNStrategy in the presence if NaNs like the other
>>> order statistics do (basically to define their position In the
>>> ordering).  It doesn't logically need more than that.  I would
>>> be ok defaulting the strategy or even returning NaN in the
>>> presence of NaNs (or whenever interpolation involves a NaN). All
>>> of this can be documented.
>> 
>> I am _surely_ missing something because I don't figure out how NaNs
>> can be useful when computing those statistics... :-}
> 
> Good point.  Basically the different strategies allow you to
> effectively ignore or work around them so stats can be well-defined.
> Without NaNStrategy, order statistics including NaNs are undefined.


The problem with Percentile is that the elements are both reordered
*and* used when interpolating between two values. Tor the rank methods,
only reordering was needed so the various NaNStrategies could be
implemented by replacing at will with +/- infinity and a classical sort
would do the ordering for us.

Here, we do interpolate on the values, so if we first replace NaN with
+infinity to put them at the end, we may interpolate between a finite
value and +inf and have +inf as a result, whereas if the NaN "value" was
preserved and some magic used to sort the NaN, we would end up
interpolating with a NaN and get a NaN as a result. Setting up the magic
for MINIMAL and MAXIMAL seems achievable withing the sort, as we already
have a specific sort (in fact not a sort but a selection algorithm,
which is the core of the class). Setting up the magic for FIXED seems
more difficult.

So I guess implementing NaNStrategies may be done in different ways
depending on how they will be used by the calling algorithm. I don't yet
know if we can set up a general method to be used for all statistics.
Looking only at Percentile, replacement with +/- infinity already seems
to not be an appropriate solution.

best regards,
Luc

> 
> 
> Phil
> 
>> 
>>> I still don't see a within-math use for recoding methods; though
>>> I do see the value from a user perspective.  The problem is if we
>>> open this door, we risk MathArrays becoming a mini commons lang.
>> 
>> If Commons Lang already provides a feature, then we indeed don't
>> need to offer the same within CM (if there is no internal use). Can
>> someone confirm?
>> 
>> 
>> Otherwise we could define a new inner interface in MathArrays,
>> similar to "MathArrays.Function": --- public class MathArrays {
>> 
>> // ...
>> 
>> public interface Transformer { /** Transform the whole array */ 
>> double[] transform(double[] input); /** Transform the array in the
>> interval [from, to) and return the whole array */ double[]
>> transform(double[] input, int from, int to); /** Transform the
>> array in the interval [from, to) and return the transformed
>> sub-array */ double[] transformAndSplice(double[] input, int from,
>> int to); } } ---
>> 
>> Then we can provide "replace" and "remove" transformers through
>> factory methods (example/untested code): --- public class
>> MathArrays {
>> 
>> // ...
>> 
>> public static Transformer createReplaceTransformer(final double
>> old, final double replace) { return new Transformer() { public
>> double[] transform(double[] input) { final int len = input.length; 
>> final double[] output = new double[len]; for (int i = 0; i < len;
>> i++) { output[i] = Precision.equalsIncludingNaN(input[i], old) ? 
>> replace : input[i]; } return output; }
>> 
>> // etc. }; } } ---
>> 
>> 
>> Gilles
>> 
>> 
>> ---------------------------------------------------------------------
>>
>> 
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>> 
> 
> ---------------------------------------------------------------------
>
> 
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] use case for NaNStrategy.FIXED?

Reply via email to