Mauro Talevi wrote:
Phil Steitz wrote:
Yes, and I would distinguish performance optimization from numerical
accuracy. From my perspective, we can release a ".0" with room for
performance improvement, but at least decent numerics are required.
I agree that decent numerics are required. I'm still rather
surprised that the diagonal covariance case would yield such bad
numerics wrt the GLS case - which has been tested with independent
fortran code to a level of 10^-6.
I have only tested the OLS implementation. To perform similar tests
against R for the GLS impl, we need to look at the R "gls" function.
See the link below for some comments on why we need to be careful with
validation tests.
We have talked in the past about providing an implementation based
on QR decomposition. Anyone up for using the QR decomposition
that we now have to do this? I really think we need to do it (or
something else to improve numerics) before releasing this class. I
will get to it eventually, but am a little pegged at the moment.
Are you proposing doing a QR decomposition of both the X and Y
matrices and working out the formulas using the decomposed ones?
No, just X. see the references here:
http://apache.markmail.org/message/3aybm5emimg5da42
I think R uses QR as described above. Comments or suggestions for other
default implementations are most welcome. We should aim to provide a
default implementation that is reasonably fast and provides good
numerics across a broad range of design matrices.
Here are some initial ideas on what should be included in the
multiple regression API. Other suggestions welcome!
1. Coefficients should be accompanied by standard errors,
t-statistics, two-sided t probablilities (can get these using t
distribution from distributions package) and ideally confidence
intervals.
2. F, R-square, adjusted R-square, F prob (again can use
distributions package to estimate)
3. ANOVA table (Regression sum of squares, residual sum of squares)
4. Residuals
R, SAS, SPSS and Excel all represent (or in the case of R, can
construct) these basic statistics in some way in their output. We
should model them in classes representing properties of the computed
model.
Perhaps we should put these on the wiki or even better in jira.
IMO, it's best to deal with the numerics and the new data input
strategies, before adding new functionality in the frame.
We do need to decide what the API is, so even if it takes a while to
implement things, or the initial implementations are naive, we should
decide what statistics we are going to provide and how we are going to
provide them. Same for the specification of models (i.e., "input data")
And finally, how do you see the no/hasIntercept model working?
As a configurable property - noIntercept means the model is estimated
without an intercept. The point I was making was more how the data
is supplied via the API. It is awkward to have to fill in a column
of 1's to get the linear algebra to work to estimate a model with
intercept (which should be the default).
ok - good point.
I would recommend that we have setData or "newData" provide a n x m
matrix, where n is the number of observations and m-1 is the number
of independent variables. Then either a) have the constructor take
another argument specifying which column holds the dependent variable
b) assume it is the first column c) support column labels and some
form of model specification such as what R provides (a lot of work)
d) split off the y vector, so setting data requires separate x and y
vectors. Probably a) is easiest for users, who will most often be
starting with a rectangular array of data with the dependent variable
in one of the columns.
Perhaps it would help if we had overloaded newData methods that accept
different input strategies, but ultimately they will produce a n x m
double array. That way we can provide users with choice.
I was thinking the same thing. The bit that is troubling me is the
omega matrix required by GLS cluttering the OLS interface. Other types
of models (e.g. weighted) will require other data. Could be we need
separate interfaces for the different types of regression, but maybe it
is better to dispense with the abstract interface altogether. The
reason we have interface / implementation separation is to allow
alternative implementations to be plugged in. Given the 2.0 approach to
support IOC, what may make more sense is to just encapsulate the core
model estimators (things like R's lm, gls), make them pluggable via
setters or constructors and get rid of the abstract interface. Any
thoughts on this?
Phil
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]