Re: [math] correlation analysis with NaNs

Gilles Sadowski Tue, 20 Nov 2012 05:28:54 -0800

Hi.

> > [...]


> >>
> >> Now is as good a time as any to think about how to correctly 
> >> represent and handle missing data.  The unfortunate thing is that in 
> >> Java working with primitive doubles we are back to the old Fortran 
> >> days of having no natural representation of a missing value.
> >> Sticking with primitives, the only thing we can do is either use NaN 
> >> or allow the "missing" designator to be configured by the user.  I am 
> >> curious what others have done in this area.
> > As you say, as I said, with primitive double, there is no value that 
> > can readily serve as "missing". It's a user's choice (e.g. 
> > "Double.NaN", "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative 
> > value", ...), that depends on the context.
> >
> >> The second question is what strategies do we support for handling
> >> missing data and how do we represent those strategies.   The
> >> simplest and easiest strategy to implement is to delete observations 
> >> that include missing data.  This is a data-only strategy and would 
> >> work the same way across algorithms.  I am afraid, however, that this 
> >> is the only strategy that is not algorithm-dependent (unless you 
> >> consider, e.g. EM as a missing data strategy or very simple 
> >> imputation strategies).  So that means individual algorithms need to 
> >> include missing data strategies in their specifications.  It might be 
> >> good to define and implement these for the correlation and regression 
> >> classes and see if we can generalize.  Any ideas on how best to do 
> >> this?
> > I'm sorry if I'm dense, but I don't remember if or why the option that 
> > users should provide clean input data to CM has been ruled out.
> > I.e. filtering (by user) is done before computation (by CM's algo).
> >
> > If the data is missing, how can you use it (to correlate, to fit, ...)?
> 
> There are multiple techniques that can be used to adjust for missing data,
> depending on the algorithm.  See [1], for example, for a summary of the
> kinds of techniques that can be used in regression. 
> Basically, saying users need to adjust the data before providing it to the
> algorithm allows only the "data only" approaches and may be inconvenient or
> make impossible other analyses to be performed on the same data.
> 
> Phil
> 
> [1]
> http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html
> 
> 
> I agree that we should consider a more comprehensive treatment of missing
> data. Perhaps we should start by designing an interface that could be
> implemented by existing classes. For example, an imputation interface could
> have methods like miimpute, mianalyze and misummarize and this interface
> could be implemented in a class that extends OLSMultipleLinearRegression.
> This approach allows each estimation method to adopt its own treatment of
> missing data. 
> 
> An alternative is to develop data structures that represent the original and
> complete data sets. Missing data methods could be applied to the data
> structures and return a complete data set for use in estimation methods.
> 
> I guess the decision is whether the missing data treatment should be part of
> an independent data structure or part integrated into estimation method.
> Just some thoughts about possible ways of handling it.
> 
> Patrick
> 

Is the issue (in CM) about handling missing data or representing missing
data?
IIUC, handling is algorithm-dependent. Representation is a matter of
convention (i.e. user-dependent).

My proposal would be that for every algorithm that is able to handle
missing data, we provide an argument (to constructors) that specifies the
"double" value that represents a missing value.


Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] correlation analysis with NaNs

Reply via email to