Hello. Le jeu. 16 mai 2019 à 10:02, Ben Nguyen <bennguye...@gmail.com> a écrit : > > Hello, > > > > I have some broad general ideas about how the regression module should be > structured, as outlined in my proposal briefly with UMLs > > This is the current implementation inside commons-math-stat-regression:
It seems there is/was an image here but I don't see it. For this kind of information, please use JIRA (and provide the link here). > > > This is my propsed idea, where the structure was partly inspired by SuanShu > since it supported multiple types of regression (including logistic): > > https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear > > > > Disclaimer: I have only studied some econometrics and second year computer > science in university, so I have zero professional data engineering > experience, but am excited to start learning with this project. So, I don’t > currently know the exact needs of data engineers in regards to this module > and am learning as I go….which is why I would very much appreciate any input > on the kinds of requirements data engineers would want from this regression > module. Basing a design on use-cases is very useful. You should collect a range of them (small/large datasets, in-memory/stream, dense/sparse) in order to figure what parts of the code can be common and what requires specialization. > > From someone who has used the current implementation or will use this new > implementation: > > What would make your life easier? > What should definitely be kept? > What should be added/improved? > Any specific features or design criterions? > Any changes or radically different approaches to the following idea? Good questions! What are your answers? ;-) > Note: OLS, GLS and Logistic regression are the first to be implemented, with > focus to make architectural support for further additions. Changes will make > use of new Java 8 features, specifically the Java Streams API to improve > performance and readability. > +1 I'd suggest to select one and start coding, without fearing that you'll probably have to change a lot of it as more use-cases are collected. > > > Updates to this proposed implementation UML in my proposal: > > “statistics-regression-reqLinearMath” will be replaced with EJML as suggested > by Mr. Eric Barnhill > > This will include a custom matrix class extended from EJML’s SimpleBase -> > StatisticsMatrix > So if we decide to use an Apache Commons implementation of matrices later on, > only this class should be changed internally. Good precaution; but I doubt that we can include everything in a single class. How to best encapsulate the linear algebra (external) library is a subject on its own, worth its own thread: Cramming many questions in a single post makes it likely that some will be missed by some people who might later on question the chosen path. [External dependencies is a sensitive issue, in Commons...] Also, I remind that we need to take into account the comparative benchmarks which I posted recently. [Even if just to conclude that EJML has overwhelming advantages (which?) that make it more suitable than its "competitors".] > > Abstract classes should have interfaces above them or perhaps just be > interfaces if a simpler approach is implemented (ie minimal OOP) > > Notes about this proposed implementation: > > AbstractVariables and it’s child classes may not be necessary, ie just > Estimators and Residuals classes > Or perhaps it’s best to follow the current implementation’s example and have > a single class per regression type for hierarchy simplicity (but risking > redundancies)? > I have not looked into specific data members or individual methods yet. So > far just taking notes from the current implementation and SuanShu > The “statistics-regression-updating” components have quite complex algorithms > which will require a lot of time for me to understand completely > > So for now, I see myself making minimal changes to them, prioritizing the new > “stored” components. IMHO, this will better discussed once an initial implementation is shown (or perhaps, as Eric suggested, with unit tests). Again, better to start a new thread for each specific question, possibly backed with a new JIRA report focussed on a particular task (see "Create sub-tasks" on JIRA). > > RegressionDataLoader’s purpose is to: > > provide a clean input interface > and to ensure that data from say double[ ][ ] is only converted to working > form as a StatisticsMatrix object once Until proven wrong, I'm a proponent of separating I/O from "useful" computations. I.e. I suggest that we consider on the one hand what API is required for all the intented functionalitites, and on the other (in a *different* "maven module"), all the conversions that may be implemented for the convenience of users. > while allowing multiple types of regression to be calculated via a universal > form…. > which could become a challenge once details are in order. > > > > So this is the current state of my plan, with your input, I will move to the > next steps, plan more details and start creating the software flowchart. > > > > Thank you in advance for any advice/suggestions, To summarize, my main suggestion is to split this post in more manageable chunks. Regards, Gilles > > -Ben Nguyen > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org