Re: [math] correlation analysis with NaNs

Thomas Neidhart Thu, 08 Nov 2012 08:01:24 -0800

On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
> Hi,
> 
> 2012/11/8 Gilles Sadowski <gil...@harfang.homelinux.org>:
>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>> Hi Patrick,
>>>
>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>> I agree that it would be nice to have a constructor that allows you to
>>>> specific the ranking algorithm only.
>>>>
>>>> As far as NaN and the Spearman correlation, maybe we should add a default
>>>> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
>>>> encountered. R uses this treatment of missing data and forces users to
>>>> choose how to handle it. If we implemented something like listwise or
>>>> pairwise deletion it could be used in other classes too. As such, treatment
>>>> of missing data should be part of a larger discussion and handled in a more
>>>> comprehensive and systematic way.
>>>
>>> I think this additional option makes sense, but I forward this
>>> discussion to the dev mailing list where it is better suited.
>>
>> I'm wary of having CM handle "missing" data.
>> For one thing we'd have to define a "convention" to represent missing data.
>> There is no good way to do that in Java. Using NaN for this purpose in a
>> low-level library is not a good idea IMHO.
>>
> I agree with Gilles, here. If I remember correctly, R has a special
> value NA, or something similar, which differs from NaN.
>>
>> Then, any convention might not be
>> suitable for some user applications, which would lead such an application's
>> developer to filter the data anyway in order to change his representation to
>> CM's representation. Rather that calling two redundant filtering codes, I'd
>> rather assume that CM gets a clean input on which its algorithm can operate.
>> As usual, the input is subjected to precondition checks, and exceptions are
>> thrown if the data is not clean enough.
>>
>> In summary: data validation (in the sense of discarding input) should not be
>> done _before_ calling CM routines.
>>
> +1.


ok, I am now confused. First you say that CM should not be involved in
data cleaning, but then you state that data validation should not be
done before calling CM? May be there is a *not* too much?

I think the proposition from Patrick was to exactly do that: throw an
exception if such invalid data is encountered (NaNStrategy.FAIL).

The other thing is, that the NaNStrategy.REMOVED is broken, so either we
fix is or deprecate it.

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] correlation analysis with NaNs

Reply via email to