Re: [math] Improving numerics in OLSMultipleLinearRegression

Phil Steitz Sat, 14 Jun 2008 14:28:47 -0700

Mauro Talevi wrote:

Phil Steitz wrote:
Yes, and I would distinguish performance optimization from numericalaccuracy. From my perspective, we can release a ".0" with room forperformance improvement, but at least decent numerics are required.
I agree that decent numerics are required. I'm still rathersurprised that the diagonal covariance case would yield such badnumerics wrt the GLS case - which has been tested with independentfortran code to a level of 10^-6.

I have only tested the OLS implementation. To perform similar testsagainst R for the GLS impl, we need to look at the R "gls" function.See the link below for some comments on why we need to be careful withvalidation tests.

We have talked in the past about providing an implementation basedon QR decomposition. Anyone up for using the QR decompositionthat we now have to do this? I really think we need to do it (orsomething else to improve numerics) before releasing this class. Iwill get to it eventually, but am a little pegged at the moment.
Are you proposing doing a QR decomposition of both the X and Ymatrices and working out the formulas using the decomposed ones?

No, just X.  see the references here:
http://apache.markmail.org/message/3aybm5emimg5da42

I think R uses QR as described above. Comments or suggestions for otherdefault implementations are most welcome. We should aim to provide adefault implementation that is reasonably fast and provides goodnumerics across a broad range of design matrices.

Here are some initial ideas on what should be included in themultiple regression API. Other suggestions welcome!
1. Coefficients should be accompanied by standard errors,t-statistics, two-sided t probablilities (can get these using tdistribution from distributions package) and ideally confidenceintervals.2. F, R-square, adjusted R-square, F prob (again can usedistributions package to estimate)
3.  ANOVA table (Regression sum of squares, residual sum of squares)
4.  Residuals
R, SAS, SPSS and Excel all represent (or in the case of R, canconstruct) these basic statistics in some way in their output. Weshould model them in classes representing properties of the computedmodel.
Perhaps we should put these on the wiki or even better in jira.
IMO, it's best to deal with the numerics and the new data inputstrategies, before adding new functionality in the frame.

We do need to decide what the API is, so even if it takes a while toimplement things, or the initial implementations are naive, we shoulddecide what statistics we are going to provide and how we are going toprovide them. Same for the specification of models (i.e., "input data")

And finally, how do you see the no/hasIntercept model working?
As a configurable property - noIntercept means the model is estimatedwithout an intercept. The point I was making was more how the datais supplied via the API. It is awkward to have to fill in a columnof 1's to get the linear algebra to work to estimate a model withintercept (which should be the default).
ok - good point.
I would recommend that we have setData or "newData" provide a n x mmatrix, where n is the number of observations and m-1 is the numberof independent variables. Then either a) have the constructor takeanother argument specifying which column holds the dependent variableb) assume it is the first column c) support column labels and someform of model specification such as what R provides (a lot of work)d) split off the y vector, so setting data requires separate x and yvectors. Probably a) is easiest for users, who will most often bestarting with a rectangular array of data with the dependent variablein one of the columns.
Perhaps it would help if we had overloaded newData methods that acceptdifferent input strategies, but ultimately they will produce a n x mdouble array. That way we can provide users with choice.

I was thinking the same thing. The bit that is troubling me is theomega matrix required by GLS cluttering the OLS interface. Other typesof models (e.g. weighted) will require other data. Could be we needseparate interfaces for the different types of regression, but maybe itis better to dispense with the abstract interface altogether. Thereason we have interface / implementation separation is to allowalternative implementations to be plugged in. Given the 2.0 approach tosupport IOC, what may make more sense is to just encapsulate the coremodel estimators (things like R's lm, gls), make them pluggable viasetters or constructors and get rid of the abstract interface. Anythoughts on this?


Phil


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [math] Improving numerics in OLSMultipleLinearRegression

Reply via email to