At Ted's suggestion I looked at LSMR in mahout. In general, I have no
complaints about the algorithm or how it is coded up. I have seen the
algorithm in the PrimalDual solver that Micheal Saunders et al cooked up. I
believe the solver is part of the COIN project. I have nothing but praises
for it.
However, what I contacted Phil about was setting up some interfaces to
define a general contract so that we could code up different ways of
performing OLS. To wit, here is what I had in mind:
public interface UpdatingLinearRegression {
public long getNobs();
public void addData( double[] x, double y);
public void addData( double[][] x, double[] y);
public void clear();
public RegressionResults regress() throws MathException;
public RegressionResults regress(int[] variablesToInclude) throws
MathException;
}
The other interface is:
public interface RegressionResults {
public double getParameterEstimate(int index) throws
IndexOutOfBoundsException;
public double[] getParameterEstimates();
public double getStdErrorOfEstimate(int index) throws
IndexOutOfBoundsException;
public double[] getStdErrorOfEstimates();
public boolean isRedundant(int index) throws IndexOutOfBoundsException,
MathException;
public boolean[] getRedundant();
public int getNumberOfParameters();
public long getNobs();
public double getTotalSumSquares();
public double getRegressionSumSquares();
public double getErrorSumSquares();
public double getMeanSquareError();
public double getRSquared();
}
Borrowing liberally from the SimpleRegressionClass, the above functionality
describes most of what a user would expect from a classical regression
analysis. What the interface buys us is the ability to support the many ways
to generate the results above: QR factorizations, in place gaussian
elimination, incremental SVD and so forth.
Thoughts?
-Greg