On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote: > On 11/08/2012 02:01 PM, Sébastien Brisard wrote: > > Hi, > > > > 2012/11/8 Gilles Sadowski <gil...@harfang.homelinux.org>: > >> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote: > >>> Hi Patrick, > >>> > >>> On 11/07/2012 04:37 PM, Patrick Meyer wrote: > >>>> I agree that it would be nice to have a constructor that allows you to > >>>> specific the ranking algorithm only. > >>>> > >>>> As far as NaN and the Spearman correlation, maybe we should add a default > >>>> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN > >>>> is > >>>> encountered. R uses this treatment of missing data and forces users to > >>>> choose how to handle it. If we implemented something like listwise or > >>>> pairwise deletion it could be used in other classes too. As such, > >>>> treatment > >>>> of missing data should be part of a larger discussion and handled in a > >>>> more > >>>> comprehensive and systematic way. > >>> > >>> I think this additional option makes sense, but I forward this > >>> discussion to the dev mailing list where it is better suited. > >> > >> I'm wary of having CM handle "missing" data. > >> For one thing we'd have to define a "convention" to represent missing data. > >> There is no good way to do that in Java. Using NaN for this purpose in a > >> low-level library is not a good idea IMHO. > >> > > I agree with Gilles, here. If I remember correctly, R has a special > > value NA, or something similar, which differs from NaN. > >> > >> Then, any convention might not be > >> suitable for some user applications, which would lead such an application's > >> developer to filter the data anyway in order to change his representation > >> to > >> CM's representation. Rather that calling two redundant filtering codes, I'd > >> rather assume that CM gets a clean input on which its algorithm can > >> operate. > >> As usual, the input is subjected to precondition checks, and exceptions are > >> thrown if the data is not clean enough. > >> > >> In summary: data validation (in the sense of discarding input) should not > >> be > >> done _before_ calling CM routines. > >> > > +1. > > ok, I am now confused. First you say that CM should not be involved in > data cleaning, but then you state that data validation should not be > done before calling CM? May be there is a *not* too much?
Yes, you are right: I wrote the opposite of what I meant. --- In summary: data validation (in the sense of discarding input) should be done _before_ calling CM routines. --- > > I think the proposition from Patrick was to exactly do that: throw an > exception if such invalid data is encountered (NaNStrategy.FAIL). > > The other thing is, that the NaNStrategy.REMOVED is broken, so either we > fix is or deprecate it. +1 [I mean (I think): If people rely on CM's removal of NaNs, we could fix it. However, if nobody could actually rely on this feature because it is broken, I'd prefer to remove it.] Sorry for the confusion, Gilles --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org