Hi. Le mer. 24 juil. 2019 à 18:00, Ben Nguyen <bennguye...@gmail.com> a écrit : > > Hello, > Thank you for your feedback, I will answer your questions: > > > Why is the constructor "protected"? > > I had thought it should be restricted to only work with ols.regress() but > then it should’ve be default. > But public might be the way to go, that restriction is probably not > necessary….?
"protected" means there is a case for defining a subclass. If the class holds data that is only instantiated by our library, there may be a case for preventing "external" instantiation (i.e. make the constructor private or package-private). > > > Why is the computation performed in that constructor although the data is > > stored in "OLSRegression"? [This looks like what PMD complains about.] > > Sorry, I don’t fully understand this question…. > From what I understand after searching about the Data Class violation: > Quote: “[OLSResults ]interface reveals more data than behaviour. Low values > (less than 30%) [OLSResults showed 20%] indicate that the class reveals much > more data than behaviour, which is a sign of poor encapsulation.” > Quote under “Weight Of Class (WOC)” > https://pmd.github.io/latest/pmd_java_metrics_index.html The point is not to make the report clean but to come up with a satisfactory design. It would be helpful to start from use-cases. > > It's good to try and have encapsulation preserved through accessors but in > > that case it's broken because some return a reference to an array (entries > > are mutable). > > I will fix that with a deepCopy() method, though I’m not sure where to put > it, in same class? > Could also do one-liner like this for 2D array: > public double[][] getBetaVariance() { > return Arrays.stream(betaVariance).map(el -> el.clone()).toArray($ -> > hatMatrix.clone()); > } > But I think .clone() is okay for primitive type 1D arrays? If not, > System.arraycopy() should be okay? There is also "copyOf" in JDK class "Arrays". > > Implementing an interface "RegressionResults" suggest that application > > should program against it, and not the specific "OLSResult"; however the > > latter has more methods than the interface defines. > > Yeah, I think this is a conflict with: > > > RegressionResults res = reg.regress(data); > > Since not all regression types calculate the same statistics, ex: OLS has > R-squared, TSS, RSS > While GLS doesn’t officially have that (unless you count pseudo-solutions out > there) > But of course, all regression types have many similar statistics like beta, > beta variance, residuals, etc > So I’m not sure how to structure an interface like RegressionResults to > account for this, > I will look carefully into the factory method design pattern and understand > it properly to solve this, > I think I might’ve missed things that would address this….? I will now > definitely refactor the current PR 23 into > a proper factory method design as well with consideration for > inner-interfaces you mentioned. I'd think it could help with the PMD report, but it will not solve the issue of the "incomplete" interface. > > What is the most useful functionality? > > What is the most robust functionality? > > IOW: Is there a case for not computing everything? > > My consideration (though I don’t have professional experience in Data > science/engineering) is maybe someone needs > to calculate one statistic from many new samples (constantly coming in) at > once…. Streams? Isn't the current "stored" implementation at odds with that use-case? > This case might not exist, but it’s what I thought of, > and considering for cases I haven’t thought of as well. The old code allowed > for this so I didn’t want to take that away…. > > > There are many redundant computations... > > Yes, PR 23 right now is kind of v1 to show everything works. Kind of version 0.1-alpha, you mean? :-) > I was focused on getting the port to work 100% with EJML first with old code > base, EJML is another issue with the current version: "StatisticsMatrix" inherits from an EJML, so there is a direct (public API) dependence (that should be avoided). > I will look at optimization between methods (and factory methods) more deeply > now (another optimization found with decomposing QR twice as well). Good but a good design usually avoids redundancy. > > > The RegressionResults interface could also include some kind of output > > > method to get all statistics in a file or displayed? > > Why should it be done within the class if it provides accessors to all > > its contents? > > This issue asked for an ANOVA table printout: > https://issues.apache.org/jira/browse/MATH-1097 > So I was considering that (even though calculating F-stat and p-value would > be another challenge), > But this lead me to consider a summary statistics printout like one found in > R, for user convenience… > Are you saying this is not necessary for this application? No, I'm saying that we should first focus on the main functionality. Regards, Gilles > > Again, thank you for your feedback, > -Ben Nguyen > > > > From: Gilles Sadowski > Sent: Tuesday, July 23, 2019 6:45 PM > To: Commons Developers List > Subject: Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design > > Hello. > > [Please note that the first tag (in square bracket) on the "Subject" line > should > be the (short) name of the component, i.e. in this case, "[Statistics]".] > > Le mar. 23 juil. 2019 à 20:31, Ben Nguyen <bennguye...@gmail.com> a écrit : > > > > Hello, > > > > Issue was noticed as a PMD violation during PR’s Travis CI build. > > “The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, > > NOPA=0, NOAM=4, WMC=6)” > > https://github.com/apache/commons-statistics/pull/23 > > > > UML for current design: > > https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png > > > > The current design is my (probably wrong) interpretation on a suggestion > > from Eric Barnhill. > > The idea is that all calculation functionality is done in the OLSRegression > > class, which passes itself to the OLSResults constructor via regress() > > method. > > Better directly show the relevant code: > > public class OLSRegression extends AbstractRegression { > // ... > > public RegressionResults regress() { > return new OLSResults(this); > } > } > > public class OLSResults implements RegressionResults, Serializable { > protected OLSResults(OLSRegression ols) { > this.beta = ols.estimateBeta(); > this.betaVariance = ols.estimateBetaVariance(); > this.betaStandardErrors = ols.estimateBetaStandardErrors(); > this.residuals = ols.estimateResiduals(); > this.regressionStandardErrors = ols.estimateRegressionStandardError(); > this.regressandVariance = ols.estimateRegressandVariance(); > > this.hatMatrix = ols.calculateHat().toArray2D(); > this.totalSumOfSquares = ols.calculateTotalSumOfSquares(); > this.residualSumOfSquares = ols.calculateResidualSumOfSquares(); > this.regRSquared = ols.calculateRSquared(); > this.adjustedRSquared = ols.calculateAdjustedRSquared(); > } > } > > Why is the constructor "protected"? > Why is the computation performed in that constructor although the data is > stored in "OLSRegression"? [This looks like what PMD complains about.] > > > The OLSResults (implementing RegressionResults interface) calls on > > statistics calculating methods from OLSRegression and saves it as private > > fields with getters. > > It's good to try and have encapsulation preserved through accessors but in > that case it's broken because some return a reference to an array (entries > are mutable). > > Implementing an interface "RegressionResults" suggest that application > should program against it, and not the specific "OLSResult"; however the > latter has more methods than the interface defines. > > > I am thinking this is good for users who want everything to be calculated > > only once to then be used multiple times (without the recalculation or > > having to save all statistics themselves). > > What is the most useful functionality? > What is the most robust functionality? > IOW: Is there a case for not computing everything? > > These are two of the methods called in the constructor above: > > public double calculateRSquared() { > return 1 - calculateResidualSumOfSquares() / > calculateTotalSumOfSquares(); > } > > public double calculateAdjustedRSquared() { > final double n = getX().numRows(); > if (getHasIntercept()) { > return 1 - (1 - calculateRSquared()) * (n / (n - > getX().numCols())); > } else { > return 1 - > (calculateResidualSumOfSquares() * (n - 1)) / > (calculateTotalSumOfSquares() * (n - > getX().numCols())); > } > } > > There are many redundant computations... > > > The RegressionResults interface could also include some kind of output > > method to get all statistics in a file or displayed? > > Why should it be done within the class if it provides accessors to all > its contents? > > > Then, for those who only want a single calculated statistic, the methods to > > calculate in OLSRegression are also public. > > Then even assuming that you fix somehow "OLSResults" to compute > each result only once, a user who directly calls the above two methods > will get subpar performance. > > > This design does cause that PMD error, so any suggestions or alternate > > designs to correct my interpretation? > > There are definite problems with your implementation that need fixing; > but it's not necessarily proof of a bad design. > However it's clearly not the most obvious one, which IMO would go along > the lines > > RegressionData data = RegressionDataLoader.from(x, y, hasIntercept); > Regression reg = new OLSRegression(); > RegressionResults res = reg.regress(data); > > This is all untested of course, and could well fall short for advanced > usage (stream?). Your design (or yet another) could be an attempt > to handle particular use-cases which the obvious approach does not > cover (?) ... > > Regards, > Gilles > > > > > Thank you, > > Cheers, > > -Ben Nguyen --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org