Hi. > > [...]
> >> > >> Now is as good a time as any to think about how to correctly > >> represent and handle missing data. The unfortunate thing is that in > >> Java working with primitive doubles we are back to the old Fortran > >> days of having no natural representation of a missing value. > >> Sticking with primitives, the only thing we can do is either use NaN > >> or allow the "missing" designator to be configured by the user. I am > >> curious what others have done in this area. > > As you say, as I said, with primitive double, there is no value that > > can readily serve as "missing". It's a user's choice (e.g. > > "Double.NaN", "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative > > value", ...), that depends on the context. > > > >> The second question is what strategies do we support for handling > >> missing data and how do we represent those strategies. The > >> simplest and easiest strategy to implement is to delete observations > >> that include missing data. This is a data-only strategy and would > >> work the same way across algorithms. I am afraid, however, that this > >> is the only strategy that is not algorithm-dependent (unless you > >> consider, e.g. EM as a missing data strategy or very simple > >> imputation strategies). So that means individual algorithms need to > >> include missing data strategies in their specifications. It might be > >> good to define and implement these for the correlation and regression > >> classes and see if we can generalize. Any ideas on how best to do > >> this? > > I'm sorry if I'm dense, but I don't remember if or why the option that > > users should provide clean input data to CM has been ruled out. > > I.e. filtering (by user) is done before computation (by CM's algo). > > > > If the data is missing, how can you use it (to correlate, to fit, ...)? > > There are multiple techniques that can be used to adjust for missing data, > depending on the algorithm. See [1], for example, for a summary of the > kinds of techniques that can be used in regression. > Basically, saying users need to adjust the data before providing it to the > algorithm allows only the "data only" approaches and may be inconvenient or > make impossible other analyses to be performed on the same data. > > Phil > > [1] > http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html > > > I agree that we should consider a more comprehensive treatment of missing > data. Perhaps we should start by designing an interface that could be > implemented by existing classes. For example, an imputation interface could > have methods like miimpute, mianalyze and misummarize and this interface > could be implemented in a class that extends OLSMultipleLinearRegression. > This approach allows each estimation method to adopt its own treatment of > missing data. > > An alternative is to develop data structures that represent the original and > complete data sets. Missing data methods could be applied to the data > structures and return a complete data set for use in estimation methods. > > I guess the decision is whether the missing data treatment should be part of > an independent data structure or part integrated into estimation method. > Just some thoughts about possible ways of handling it. > > Patrick > Is the issue (in CM) about handling missing data or representing missing data? IIUC, handling is algorithm-dependent. Representation is a matter of convention (i.e. user-dependent). My proposal would be that for every algorithm that is able to handle missing data, we provide an argument (to constructors) that specifies the "double" value that represents a missing value. Regards, Gilles --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org