Re: [math] Improving numerics in OLSMultipleLinearRegression

Phil Steitz Mon, 09 Jun 2008 18:35:34 -0700

Mauro Talevi wrote:

Hi Phil,
thanks for reviewing the multiple linear regression implementationsand setting up the R/NIST data tests. I finally got around toinstalling R and can now run them too.
Phil Steitz wrote:
While clear and elegant from a matrix algebra standpoint, the"nailve" implementation in OLSMultipleLinearRegression has badnumerical qualities. It is well known that solving the normalequations directly does not give good numerics. I just added sometests to actually verify parameter values, using the classic "Longly"dataset, for which NIST provides certified statistics. This is a"hard" design matrix. R was able to get to within 1E-8 of thecertified parameter values. OLSMultipleLinearRegression can only get1E-1.
The OLS implementation has been added as a simple by-product of theGLS case - which is the main one I have needed for hypothesis testing- as it came "for free" with unitary covariance.True - the emphasis was on clarity and formulaic simplicity. And alsofollowing the old Donald Knuth maxim "optimization is the root of allevil". But it seems like there is a need for refinement of theimplementation - the devil raised his head :-)

Yes, and I would distinguish performance optimization from numericalaccuracy. From my perspective, we can release a ".0" with room forperformance improvement, but at least decent numerics are required.

We have talked in the past about providing an implementation based onQR decomposition. Anyone up for using the QR decomposition that wenow have to do this? I really think we need to do it (or somethingelse to improve numerics) before releasing this class. I will get toit eventually, but am a little pegged at the moment. I will reviewand apply patches if someone is willing to do the implementation. Ican also explain here or offline how the R tests and NIST datasetswork, as these are useful in validating code.
I'd be happy to improve the impl. I'm getting my head around R andNIST, but perhaps a chat offline would not hurt!

I may be hard to catch synchronously, as my day-and-night job is alittle demanding, but I would be happy to answer questions (with maybe alittle latency ;)

Another thing that we should think about before releasing any of thisstuff is the completeness of the API. Many standard regressionstatistics are missing. If we are going to stick with the Interface/ Implementation setup, we need to get the right stuff into theinterface. It is also awkward to have to insert "1"'s in the designmatrix to get an intercept term computed. This is convenient forimplementation, but awkward for users. A more natural setup (IMHO)would be to expose a "noIntercept" or "hasIntercept" property for themodel.
No problem with adding other statistics - let's just decide on what isthe stardard regression API.

Here are some initial ideas on what should be included in the multipleregression API. Other suggestions welcome!

1. Coefficients should be accompanied by standard errors, t-statistics,two-sided t probablilities (can get these using t distribution fromdistributions package) and ideally confidence intervals.2. F, R-square, adjusted R-square, F prob (again can use distributionspackage to estimate)

3.  ANOVA table (Regression sum of squares, residual sum of squares)
4.  Residuals

R, SAS, SPSS and Excel all represent (or in the case of R, canconstruct) these basic statistics in some way in their output. Weshould model them in classes representing properties of the computedmodel.


And finally, how do you see the no/hasIntercept model working?

As a configurable property - noIntercept means the model is estimatedwithout an intercept. The point I was making was more how the data issupplied via the API. It is awkward to have to fill in a column of 1'sto get the linear algebra to work to estimate a model with intercept(which should be the default).

I would recommend that we have setData or "newData" provide a n x mmatrix, where n is the number of observations and m-1 is the number ofindependent variables. Then either a) have the constructor take anotherargument specifying which column holds the dependent variable b) assumeit is the first column c) support column labels and some form of modelspecification such as what R provides (a lot of work) d) split off the yvector, so setting data requires separate x and y vectors. Probably a)is easiest for users, who will most often be starting with a rectangulararray of data with the dependent variable in one of the columns.


Phil


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [math] Improving numerics in OLSMultipleLinearRegression

Reply via email to