Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Gilles Sadowski Wed, 24 Jul 2019 15:17:25 -0700

Hi.

Le mer. 24 juil. 2019 à 18:00, Ben Nguyen <bennguye...@gmail.com> a écrit :
>
> Hello,
> Thank you for your feedback, I will answer your questions:
>
> > Why is the constructor "protected"?
>
> I had thought it should be restricted to only work with ols.regress() but 
> then it should’ve be default.
> But public might be the way to go, that restriction is probably not 
> necessary….?


"protected" means there is a case for defining a subclass.

If the class holds data that is only instantiated by our library,
there may be a case for preventing "external" instantiation (i.e.
make the constructor private or package-private).

>
> > Why is the computation performed in that constructor although the data is
> > stored in "OLSRegression"? [This looks like what PMD complains about.]
>
> Sorry, I don’t fully understand this question….
> From what I understand after searching about the Data Class violation:
> Quote: “[OLSResults ]interface reveals more data than behaviour. Low values 
> (less than 30%) [OLSResults showed 20%] indicate that the class reveals much 
> more data than behaviour, which is a sign of poor encapsulation.”
> Quote under “Weight Of Class (WOC)”
> https://pmd.github.io/latest/pmd_java_metrics_index.html

The point is not to make the report clean but to come up
with a satisfactory design.
It would be helpful to start from use-cases.

> > It's good to try and have encapsulation preserved through accessors but in
> > that case it's broken because some return a reference to an array (entries
> > are mutable).
>
> I will fix that with a deepCopy() method, though I’m not sure where to put 
> it, in same class?
> Could also do one-liner like this for 2D array:
>     public double[][] getBetaVariance() {
>         return Arrays.stream(betaVariance).map(el -> el.clone()).toArray($ -> 
> hatMatrix.clone());
>     }
> But I think .clone() is okay for primitive type 1D arrays? If not, 
> System.arraycopy() should be okay?

There is also "copyOf" in JDK class "Arrays".

> > Implementing an interface "RegressionResults" suggest that application
> > should program against it, and not the specific "OLSResult"; however the
> > latter has more methods than the interface defines.
>
> Yeah, I think this is a conflict with:
>
> > RegressionResults res = reg.regress(data);
>
> Since not all regression types calculate the same statistics, ex: OLS has 
> R-squared, TSS, RSS
> While GLS doesn’t officially have that (unless you count pseudo-solutions out 
> there)
> But of course, all regression types have many similar statistics like beta, 
> beta variance, residuals, etc
> So I’m not sure how to structure an interface like RegressionResults to 
> account for this,
> I will look carefully into the factory method design pattern and understand 
> it properly to solve this,
> I think I might’ve missed things that would address this….? I will now 
> definitely refactor the current PR 23 into
> a proper factory method design as well with consideration for 
> inner-interfaces you mentioned.

I'd think it could help with the PMD report, but it will not solve
the issue of the "incomplete" interface.

> > What is the most useful functionality?
> > What is the most robust functionality?
> > IOW:  Is there a case for not computing everything?
>
> My consideration (though I don’t have professional experience in Data 
> science/engineering) is maybe someone needs
> to calculate one statistic from many new samples (constantly coming in) at 
> once….

Streams?

Isn't the current "stored" implementation at odds with that use-case?

> This case might not exist, but it’s what I thought of,
> and considering for cases I haven’t thought of as well. The old code allowed 
> for this so I didn’t want to take that away….
>
> > There are many redundant computations...
>
> Yes, PR 23 right now is kind of v1 to show everything works.

Kind of version 0.1-alpha, you mean? :-)

> I was focused on getting the port to work 100% with EJML first with old code 
> base,

EJML is another issue with the current version: "StatisticsMatrix" inherits
from an EJML, so there is a direct (public API) dependence (that should
be avoided).

> I will look at optimization between methods (and factory methods) more deeply 
> now (another optimization found with decomposing QR twice as well).

Good but a good design usually avoids redundancy.

> > > The RegressionResults interface could also include some kind of output 
> > > method to get all statistics in a file or displayed?
> > Why should it be done within the class if it provides accessors to all
> > its contents?
>
> This issue asked for an ANOVA table printout: 
> https://issues.apache.org/jira/browse/MATH-1097
> So I was considering that (even though calculating F-stat and p-value would 
> be another challenge),
> But this lead me to consider a summary statistics printout like one found in 
> R, for user convenience…
> Are you saying this is not necessary for this application?

No, I'm saying that we should first focus on the main functionality.

Regards,
Gilles

>
> Again, thank you for your feedback,
> -Ben Nguyen
>
>
>
> From: Gilles Sadowski
> Sent: Tuesday, July 23, 2019 6:45 PM
> To: Commons Developers List
> Subject: Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design
>
> Hello.
>
> [Please note that the first tag (in square bracket) on the "Subject" line 
> should
> be the (short) name of the component, i.e. in this case, "[Statistics]".]
>
> Le mar. 23 juil. 2019 à 20:31, Ben Nguyen <bennguye...@gmail.com> a écrit :
> >
> > Hello,
> >
> > Issue was noticed as a PMD violation during PR’s Travis CI build.
> > “The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, 
> > NOPA=0, NOAM=4, WMC=6)”
> > https://github.com/apache/commons-statistics/pull/23
> >
> > UML for current design:
> > https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png
> >
> > The current design is my (probably wrong) interpretation on a suggestion 
> > from Eric Barnhill.
> > The idea is that all calculation functionality is done in the OLSRegression 
> > class, which passes itself to the OLSResults constructor via regress() 
> > method.
>
> Better directly show the relevant code:
>
> public class OLSRegression extends AbstractRegression {
>     // ...
>
>     public RegressionResults regress() {
>         return new OLSResults(this);
>     }
> }
>
> public class OLSResults implements RegressionResults, Serializable {
>     protected OLSResults(OLSRegression ols) {
>         this.beta = ols.estimateBeta();
>         this.betaVariance = ols.estimateBetaVariance();
>         this.betaStandardErrors = ols.estimateBetaStandardErrors();
>         this.residuals = ols.estimateResiduals();
>         this.regressionStandardErrors = ols.estimateRegressionStandardError();
>         this.regressandVariance = ols.estimateRegressandVariance();
>
>         this.hatMatrix = ols.calculateHat().toArray2D();
>         this.totalSumOfSquares = ols.calculateTotalSumOfSquares();
>         this.residualSumOfSquares = ols.calculateResidualSumOfSquares();
>         this.regRSquared = ols.calculateRSquared();
>         this.adjustedRSquared = ols.calculateAdjustedRSquared();
>     }
> }
>
> Why is the constructor "protected"?
> Why is the computation performed in that constructor although the data is
> stored in "OLSRegression"? [This looks like what PMD complains about.]
>
> > The OLSResults (implementing RegressionResults interface) calls on 
> > statistics calculating methods from OLSRegression and saves it as private 
> > fields with getters.
>
> It's good to try and have encapsulation preserved through accessors but in
> that case it's broken because some return a reference to an array (entries
> are mutable).
>
> Implementing an interface "RegressionResults" suggest that application
> should program against it, and not the specific "OLSResult"; however the
> latter has more methods than the interface defines.
>
> > I am thinking this is good for users who want everything to be calculated 
> > only once to then be used multiple times (without the recalculation or 
> > having to save all statistics themselves).
>
> What is the most useful functionality?
> What is the most robust functionality?
> IOW:  Is there a case for not computing everything?
>
> These are two of the methods called in the constructor above:
>
>     public double calculateRSquared() {
>         return 1 - calculateResidualSumOfSquares() /
> calculateTotalSumOfSquares();
>     }
>
>     public double calculateAdjustedRSquared() {
>         final double n = getX().numRows();
>         if (getHasIntercept()) {
>             return 1 - (1 - calculateRSquared()) * (n / (n - 
> getX().numCols()));
>         } else {
>             return 1 -
>                    (calculateResidualSumOfSquares() * (n - 1)) /
>                        (calculateTotalSumOfSquares() * (n - 
> getX().numCols()));
>         }
>     }
>
> There are many redundant computations...
>
> > The RegressionResults interface could also include some kind of output 
> > method to get all statistics in a file or displayed?
>
> Why should it be done within the class if it provides accessors to all
> its contents?
>
> > Then, for those who only want a single calculated statistic, the methods to 
> > calculate in OLSRegression are also public.
>
> Then even assuming that you fix somehow "OLSResults" to compute
> each result only once, a user who directly calls the above two methods
> will get subpar performance.
>
> > This design does cause that PMD error, so any suggestions or alternate 
> > designs to correct my interpretation?
>
> There are definite problems with your implementation that need fixing;
> but it's not necessarily proof of a bad design.
> However it's clearly not the most obvious one, which IMO would go along
> the lines
>
> RegressionData data = RegressionDataLoader.from(x, y, hasIntercept);
> Regression reg = new OLSRegression();
> RegressionResults res = reg.regress(data);
>
> This is all untested of course, and could well fall short for advanced
> usage (stream?).  Your design (or yet another) could be an attempt
> to handle particular use-cases which the obvious approach does not
> cover (?) ...
>
> Regards,
> Gilles
>
> >
> > Thank you,
> > Cheers,
> > -Ben Nguyen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Reply via email to