On Sun, Nov 18, 2012 at 11:01:18PM +0100, Thomas Neidhart wrote: > On 11/09/2012 11:14 PM, Phil Steitz wrote: > > On 11/9/12 12:18 AM, Thomas Neidhart wrote: > >> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <phil.ste...@gmail.com> wrote: > >> > >>> On 11/8/12 9:44 AM, Phil Steitz wrote: > >>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote: > >>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote: > >>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> 2012/11/8 Gilles Sadowski <gil...@harfang.homelinux.org>: > >>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote: > >>>>>>>>> Hi Patrick, > >>>>>>>>> > >>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote: > >>>>>>>>>> I agree that it would be nice to have a constructor that allows > >>> you to > >>>>>>>>>> specific the ranking algorithm only. > >>>> +1 - patches welcome. > >>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a > >>> default > >>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if > >>> any NaN is > >>>>>>>>>> encountered. R uses this treatment of missing data and forces > >>> users to > >>>>>>>>>> choose how to handle it. If we implemented something like listwise > >>> or > >>>>>>>>>> pairwise deletion it could be used in other classes too. As such, > >>> treatment > >>>>>>>>>> of missing data should be part of a larger discussion and handled > >>> in a more > >>>>>>>>>> comprehensive and systematic way. > >>>> +1 to develop a strategy for representing how to represent and > >>>> handle missing data (see below) > >>>>>>>>> I think this additional option makes sense, but I forward this > >>>>>>>>> discussion to the dev mailing list where it is better suited. > >>>>>>>> I'm wary of having CM handle "missing" data. > >>>>>>>> For one thing we'd have to define a "convention" to represent > >>> missing data. > >>>>>>>> There is no good way to do that in Java. Using NaN for this purpose > >>> in a > >>>>>>>> low-level library is not a good idea IMHO. > >>>>>>>> > >>>>>>> I agree with Gilles, here. If I remember correctly, R has a special > >>>>>>> value NA, or something similar, which differs from NaN. > >>>>>>>> Then, any convention might not be > >>>>>>>> suitable for some user applications, which would lead such an > >>> application's > >>>>>>>> developer to filter the data anyway in order to change his > >>> representation to > >>>>>>>> CM's representation. Rather that calling two redundant filtering > >>> codes, I'd > >>>>>>>> rather assume that CM gets a clean input on which its algorithm can > >>> operate. > >>>>>>>> As usual, the input is subjected to precondition checks, and > >>> exceptions are > >>>>>>>> thrown if the data is not clean enough. > >>>>>>>> > >>>>>>>> In summary: data validation (in the sense of discarding input) > >>> should not be > >>>>>>>> done _before_ calling CM routines. > >>>>>>>> > >>>>>>> +1. > >>>>>> ok, I am now confused. First you say that CM should not be involved in > >>>>>> data cleaning, but then you state that data validation should not be > >>>>>> done before calling CM? May be there is a *not* too much? > >>>>> Yes, you are right: I wrote the opposite of what I meant. > >>>>> --- > >>>>> In summary: data validation (in the sense of discarding input) should > >>>>> be done _before_ calling CM routines. > >>>>> --- > >>>>> > >>>>>> I think the proposition from Patrick was to exactly do that: throw an > >>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL). > >>>>>> > >>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either > >>> we > >>>>>> fix is or deprecate it. > >>>> That we should fix. Please open a JIRA for this. I assume you are > >>>> talking about the implementation in NaturalRanking. > >>>>> +1 > >>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix > >>> it. > >>>>> However, if nobody could actually rely on this feature because it is > >>> broken, > >>>>> I'd prefer to remove it.] > >>>> There are two issues here. One is specific to ranking algorithms. > >>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since > >>>> the result has to be a total ordering. The NaNStrategy.REMOVED > >>>> strategy is intended to represent removal of NaNs from the data to > >>>> be ordered. If it is not implemented correctly in NaturalRanking or > >>>> other rankings that is a bug and needs to be fixed. > >>> Sorry, I just reread Patrick's original mail. IIUC, there is > >>> nothing wrong with the implementation of NaNStrategy.REMOVED in > >>> NaturalRanking or other implemented rankings. The problem is how > >>> the Spearman's impl handles it. That is indeed a bug in Spearman's > >>> impl that should be fixed. The correct fix is to throw out the > >>> corresponding entry in the second array when REMOVED is the > >>> configured NaNStrategy. I agree with Patrick that adding .FAIL and > >>> setting that as the default is a good idea. Patches welcome. > >>>> The second issue is the more general one of how to represent and > >>>> handle missing data. I have always seen that as a limitation that > >>>> we would eventually address on an algorithm by algorithm basis. > >>>> Different algorithms can be configured to do different things when > >>>> missing data are encountered. It is not always possible or > >>>> desirable to preprocess the data to "eliminate" or impute missing > >>>> data. Saying that we are just not going to deal with it is a > >>>> limitation that I don't think we should impose. I am would like to > >>>> hear others' ideas about good ways to model missing data in Java. > >> Hi Phil, > >> > >> ok I have created three new issues: > >> > >> * MATH-891 > >> * MATH-892 > >> * MATH-893 > > > > Thanks! > >> > >> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust > >> the RankingAlgorithm interface a bit. Right now, it only takes as input a > >> one-dimensional array. But in case of correlations, you have two input > >> arrays. If you remove from one array the NaN values, you have no means to > >> know at which index they have been removed to do the same with the other > >> array. > > > > Or you push that responsibility to the client - in this case > > SpearmansCorrelation. My first thought on how to fix the > > Spearman's impl was to have it compare lengths of ranked / unranked > > when invoked with the REMOVED NaN strategy and then scan the > > original arrays when removals happen, adjusting the ranked arrays > > accordingly. > > I thought about this a bit more, and I do not think it can be done > safely on the client side (i.e. SpearmansCorrelation). > > Consider the following case: > > x: [NaN, 1, 2] > y: [1, NaN, 2] > > the ranking algorithm with a NaNStrategy of REMOVED would rank as follows: > > x: [1, 2] > y: [1, 2] > > on the client side, everything looks fine, but in fact we would > correlate wrong data. > > Additionally, on the client side, we have no means to know the actual > NaNStrategy that is used, as it is hidden in the ranking algorithm. > > Moreover, comparing with the original array may also not work, as the > ranking algorithm may change the data, so alignment is not always possible. > > >>> configured NaNStrategy. I agree with Patrick that adding .FAIL and > >>> setting that as the default is a good idea. Patches welcome. > > The NaNStrategy.FAILED has been added already, shall we make it the > default then, what do you think?
+1 Gilles --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org