On 6/24/11 11:44 AM, Greg Sterijevski wrote: > Hello All, > > I have been a user of the math commons jar for a little over a year and am > very impressed with it. I was wondering whether anyone is actively working > on implementing functionality to do regressions on very very large data > sets. The current implementation of the OLS routine is an in-core QR > decomposition with substitution. While the solutions are typically accurate, > the in-core nature limits the usefulness of these objects. > > Looking through the code, most of the implementation of an InputStream based > regression routine would respect the contract implicit in the interface > MultipleLinearRegression. However, large regression problems are important > enough that there should be a way to: > > 1. Wrap a potentially large data source, perhaps as an InputStream of some > sort. > 2. Have a separate contract with methods like clear() ( to clear whatever > intermediate calculations are stored), and regress() which generates > immutable results that are not affected by further updates of the data. > > I would appreciate any thoughts or comments, as well suggestions about > functionality already in math commons which might address some points I > raised. > > Thank you, > > -Greg > Hi Greg,
Thanks for the feedback and suggestion. You are correct that the multiple regression classes use QR decomp of the design matrix, so are not really suitable for very large datasets. I agree that this would make a good enhancement and I would be willing to work with you on design and implementation. The SimpleRegression class, which handles only bivariate regression has what amounts to a streaming interface now, so for just bivariate models, arbitrary-sized datasets can be accommodated with the current code. But multiple regression will require some more work. If you are interested in working on this, please open a JIRA and start with a patch proposing the API enhancements above in a new class. I am not sure if it makes sense to have the new class extend AbstractMultipleLinearRegression, since that class really is fixed-model oriented and methods like getResidials() would have to be dropped or replaced by methods returning streams. I would say start with a new class and do not feel constrained to conform to the matrix-oriented API in the current (multiple) regression classes. The API of SimpleRegression may actually be a better model to start with. As we prepare for 3.0, we have the opportunity to improve / repair the 2.x API, so if you have comments or suggestions for improvement of the existing classes, those would also be most welcome. Thanks! Phil --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org