Re: [math] correlation analysis with NaNs

Phil Steitz Fri, 09 Nov 2012 14:14:48 -0800

On 11/9/12 12:18 AM, Thomas Neidhart wrote:
> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <phil.ste...@gmail.com> wrote:
>
>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>>>> Hi,
>>>>>>
>>>>>> 2012/11/8 Gilles Sadowski <gil...@harfang.homelinux.org>:
>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>>>> Hi Patrick,
>>>>>>>>
>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>> I agree that it would be nice to have a constructor that allows
>> you to
>>>>>>>>> specific the ranking algorithm only.
>>> +1 - patches welcome.
>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
>> default
>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
>> any NaN is
>>>>>>>>> encountered. R uses this treatment of missing data and forces
>> users to
>>>>>>>>> choose how to handle it. If we implemented something like listwise
>> or
>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
>> treatment
>>>>>>>>> of missing data should be part of a larger discussion and handled
>> in a more
>>>>>>>>> comprehensive and systematic way.
>>> +1 to develop a strategy for representing how to represent and
>>> handle missing data (see below)
>>>>>>>> I think this additional option makes sense, but I forward this
>>>>>>>> discussion to the dev mailing list where it is better suited.
>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>> For one thing we'd have to define a "convention" to represent
>> missing data.
>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
>> in a
>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>
>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
>>>>>> value NA, or something similar, which differs from NaN.
>>>>>>> Then, any convention might not be
>>>>>>> suitable for some user applications, which would lead such an
>> application's
>>>>>>> developer to filter the data anyway in order to change his
>> representation to
>>>>>>> CM's representation. Rather that calling two redundant filtering
>> codes, I'd
>>>>>>> rather assume that CM gets a clean input on which its algorithm can
>> operate.
>>>>>>> As usual, the input is subjected to precondition checks, and
>> exceptions are
>>>>>>> thrown if the data is not clean enough.
>>>>>>>
>>>>>>> In summary: data validation (in the sense of discarding input)
>> should not be
>>>>>>> done _before_ calling CM routines.
>>>>>>>
>>>>>> +1.
>>>>> ok, I am now confused. First you say that CM should not be involved in
>>>>> data cleaning, but then you state that data validation should not be
>>>>> done before calling CM? May be there is a *not* too much?
>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>> ---
>>>>   In summary: data validation (in the sense of discarding input) should
>>>>   be done _before_ calling CM routines.
>>>> ---
>>>>
>>>>> I think the proposition from Patrick was to exactly do that: throw an
>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>>>
>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
>> we
>>>>> fix is or deprecate it.
>>> That we should fix.  Please open a JIRA for this.  I assume you are
>>> talking about the implementation in NaturalRanking.
>>>> +1
>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
>> it.
>>>> However, if nobody could actually rely on this feature because it is
>> broken,
>>>> I'd prefer to remove it.]
>>> There are two issues here.  One is specific to ranking algorithms.
>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
>>> strategy is intended to represent removal of NaNs from the data to
>>> be ordered.  If it is not implemented correctly in NaturalRanking or
>>> other rankings that is a bug and needs to be fixed.
>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>> NaturalRanking or other implemented rankings.  The problem is how
>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
>> impl that should be fixed.  The correct fix is to throw out the
>> corresponding entry in the second array when REMOVED is the
>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>> setting that as the default is a good idea.  Patches welcome.
>>> The second issue is the more general one of how to represent and
>>> handle missing data.  I have always seen that as a limitation that
>>> we would eventually address on an algorithm by algorithm basis.
>>> Different algorithms can be configured to do different things when
>>> missing data are encountered.  It is not always possible or
>>> desirable to preprocess the data to "eliminate" or impute missing
>>> data.  Saying that we are just not going to deal with it is a
>>> limitation that I don't think we should impose.  I am would like to
>>> hear others' ideas about good ways to model missing data in Java.
> Hi Phil,
>
> ok I have created three new issues:
>
>  * MATH-891
>  * MATH-892
>  * MATH-893


Thanks!
>
> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
> the RankingAlgorithm interface a bit. Right now, it only takes as input a
> one-dimensional array. But in case of correlations, you have two input
> arrays. If you remove from one array the NaN values, you have no means to
> know at which index they have been removed to do the same with the other
> array.

Or you push that responsibility to the client - in this case
SpearmansCorrelation.   My first thought on how to fix the
Spearman's impl was to have it compare lengths of ranked / unranked
when invoked with the REMOVED NaN strategy and then scan the
original arrays when removals happen, adjusting the ranked arrays
accordingly.  

Phil
>
> Thomas
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] correlation analysis with NaNs

Reply via email to