On 11/9/12 12:18 AM, Thomas Neidhart wrote: > On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <phil.ste...@gmail.com> wrote: > >> On 11/8/12 9:44 AM, Phil Steitz wrote: >>> On 11/8/12 8:23 AM, Gilles Sadowski wrote: >>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote: >>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote: >>>>>> Hi, >>>>>> >>>>>> 2012/11/8 Gilles Sadowski <gil...@harfang.homelinux.org>: >>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote: >>>>>>>> Hi Patrick, >>>>>>>> >>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote: >>>>>>>>> I agree that it would be nice to have a constructor that allows >> you to >>>>>>>>> specific the ranking algorithm only. >>> +1 - patches welcome. >>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a >> default >>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if >> any NaN is >>>>>>>>> encountered. R uses this treatment of missing data and forces >> users to >>>>>>>>> choose how to handle it. If we implemented something like listwise >> or >>>>>>>>> pairwise deletion it could be used in other classes too. As such, >> treatment >>>>>>>>> of missing data should be part of a larger discussion and handled >> in a more >>>>>>>>> comprehensive and systematic way. >>> +1 to develop a strategy for representing how to represent and >>> handle missing data (see below) >>>>>>>> I think this additional option makes sense, but I forward this >>>>>>>> discussion to the dev mailing list where it is better suited. >>>>>>> I'm wary of having CM handle "missing" data. >>>>>>> For one thing we'd have to define a "convention" to represent >> missing data. >>>>>>> There is no good way to do that in Java. Using NaN for this purpose >> in a >>>>>>> low-level library is not a good idea IMHO. >>>>>>> >>>>>> I agree with Gilles, here. If I remember correctly, R has a special >>>>>> value NA, or something similar, which differs from NaN. >>>>>>> Then, any convention might not be >>>>>>> suitable for some user applications, which would lead such an >> application's >>>>>>> developer to filter the data anyway in order to change his >> representation to >>>>>>> CM's representation. Rather that calling two redundant filtering >> codes, I'd >>>>>>> rather assume that CM gets a clean input on which its algorithm can >> operate. >>>>>>> As usual, the input is subjected to precondition checks, and >> exceptions are >>>>>>> thrown if the data is not clean enough. >>>>>>> >>>>>>> In summary: data validation (in the sense of discarding input) >> should not be >>>>>>> done _before_ calling CM routines. >>>>>>> >>>>>> +1. >>>>> ok, I am now confused. First you say that CM should not be involved in >>>>> data cleaning, but then you state that data validation should not be >>>>> done before calling CM? May be there is a *not* too much? >>>> Yes, you are right: I wrote the opposite of what I meant. >>>> --- >>>> In summary: data validation (in the sense of discarding input) should >>>> be done _before_ calling CM routines. >>>> --- >>>> >>>>> I think the proposition from Patrick was to exactly do that: throw an >>>>> exception if such invalid data is encountered (NaNStrategy.FAIL). >>>>> >>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either >> we >>>>> fix is or deprecate it. >>> That we should fix. Please open a JIRA for this. I assume you are >>> talking about the implementation in NaturalRanking. >>>> +1 >>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix >> it. >>>> However, if nobody could actually rely on this feature because it is >> broken, >>>> I'd prefer to remove it.] >>> There are two issues here. One is specific to ranking algorithms. >>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since >>> the result has to be a total ordering. The NaNStrategy.REMOVED >>> strategy is intended to represent removal of NaNs from the data to >>> be ordered. If it is not implemented correctly in NaturalRanking or >>> other rankings that is a bug and needs to be fixed. >> Sorry, I just reread Patrick's original mail. IIUC, there is >> nothing wrong with the implementation of NaNStrategy.REMOVED in >> NaturalRanking or other implemented rankings. The problem is how >> the Spearman's impl handles it. That is indeed a bug in Spearman's >> impl that should be fixed. The correct fix is to throw out the >> corresponding entry in the second array when REMOVED is the >> configured NaNStrategy. I agree with Patrick that adding .FAIL and >> setting that as the default is a good idea. Patches welcome. >>> The second issue is the more general one of how to represent and >>> handle missing data. I have always seen that as a limitation that >>> we would eventually address on an algorithm by algorithm basis. >>> Different algorithms can be configured to do different things when >>> missing data are encountered. It is not always possible or >>> desirable to preprocess the data to "eliminate" or impute missing >>> data. Saying that we are just not going to deal with it is a >>> limitation that I don't think we should impose. I am would like to >>> hear others' ideas about good ways to model missing data in Java. > Hi Phil, > > ok I have created three new issues: > > * MATH-891 > * MATH-892 > * MATH-893
Thanks! > > Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust > the RankingAlgorithm interface a bit. Right now, it only takes as input a > one-dimensional array. But in case of correlations, you have two input > arrays. If you remove from one array the NaN values, you have no means to > know at which index they have been removed to do the same with the other > array. Or you push that responsibility to the client - in this case SpearmansCorrelation. My first thought on how to fix the Spearman's impl was to have it compare lengths of ranked / unranked when invoked with the REMOVED NaN strategy and then scan the original arrays when removals happen, adjusting the ranked arrays accordingly. Phil > > Thomas > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org