[ 
https://issues.apache.org/jira/browse/MATH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940737#comment-17940737
 ] 

Alex Herbert commented on MATH-1671:
------------------------------------

I have removed a large amount of the descriptive package in [PR 
270|https://github.com/apache/commons-math/pull/260]. Many of the remaining 
computations are now delegated to Commons Statistics (CS).

Changes are summarised in the commit history:
{noformat}
Removed percentile classes from descriptive.rank:
    
    CentralPivotingStrategy
    KthSelector
    Median
    MedianOf3PivotingStrategy
    Percentile
    PivotingStrategy
    RandomPivotingStrategy

Refactor the SummaryStatistics default implementations to use
DoubleStatistics.
    
    Removes method:
    
    getPopulationVariance
    getSecondMoment
    
    The population variance relies the second moment implementation which
    computes a statistic related to the central second moment. It is not a
    standard statistic and is not supported in Commons Statistics.
    
    Updates the min/max implementations to use Math.min/max. Previous
    behaviour ignored NaN values. The change now matches with JDK stream
    behaviour.

Refactor the DescriptiveStatistics default implementations to use
DoubleStatistics.
    
    Removes method:
    
    getPopulationVariance
    
    The variance implementation can be overridden if desired.
    
    Updates the min/max implementations to use Math.min/max. Previous
    behaviour ignored NaN values. The change now matches with JDK stream
    behaviour.

Removes redundant classes.
    
    descriptive.moment:
    
    - FirstMoment
    - FourthMoment
    - GeometricMean
    - Kurtosis
    - SecondMoment
    - Skewness
    - StandardDeviation
    - ThirdMoment
    
    Mean + Variance have been changed to only implement the weighted
    evaluation interface.
    
    descriptive.rank:
    - Min
    - Max
    
    descriptive.summary:
    - Sum
    - SumOfLogs
    - SumOfSquares
    
    Product has been changed to only implement the weighted evaluation
    interface.
    
    The utility class StatUtils has been updated to delegate all calls to
    Commons Statistics. Legacy Math exceptions have been preserved. Removes
    methods to compute the variance using an existing mean:
    
    public static double variance(double[] values, double mean, int begin,
    int length)
    public static double variance(double[] values, double mean)
    public static double populationVariance(double[] values, double mean,
    int begin, int length)
    public static double populationVariance(double[] values, double mean)
    
    Note: StatUtils has inconsistent documentation of what to return for an
    empty array. The documentation states NaN but StatUtilsTest requires
    otherwise:
    
    Sum-of-squares = 0
    Product = 1
    Sum-of-logs = 0
    
    This is inconsistent and has been updated to NaN for all statistics.
    
    The class MultivariateSummaryStatistics has been updated with partial
    implementations of StorelessUnivariateStatistic that delegate to Commons
    Statistics.
    
    Some test classes have been updated to pass the build after removal of
    the statistic implementations.
{noformat}
Here is what is left:
{noformat}
descriptive.moment
- Mean
- SemiVariance
- Variance
- VectorialCovariance
- VectorialMean

descriptive.rank
- PSquarePercentile

descriptive.summary:
- Product

descriptive:
- AbstractStorelessUnivariateStatistic
- AbstractUnivariateStatistic
- AggregateSummaryStatistics
- DescriptiveStatistics
- MultivariateSummaryStatistics
- StatisticalMultivariateSummary
- StatisticalSummary
- StatisticalSummaryValues
- StorelessUnivariateStatistic
- SummaryStatistics
- SynchronizedDescriptiveStatistics
- SynchronizedMultivariateSummaryStatistics
- SynchronizedSummaryStatistics
- UnivariateStatistic
- WeightedEvaluation
{noformat}
The Mean, Variance and Product are only present to implement the 
WeightedEvaluation interface. These are stateless classes and could be made 
into a singleton. Weighted evaluation is not supported in Commons Statistics 
(CS).

The SemiVariance is not supported in CS. The [semi-variance 
(wikipedia)|https://en.wikipedia.org/wiki/Variance#Semivariance] is computed 
using all deviations on one side of a threshold (e.g. the mean). CS 
implementations support streaming values and can be combined, e.g. adding two 
variances together. It is not valid to add two semi variances with a different 
cut-off, and if the cut-off is dynamic (e.g. the mean) then it is not possible 
to compute on streamed values as each update to the mean would require 
reassessing all previous values that are on one side of the new cut-off.

The PSquarePercentile is a method to compute a percentile estimate on a stream 
of values without storing them all. It works but is very slow. The 
PSquarePercentileTest.testDistribution test takes ~8 seconds. When refactoring 
I noticed that this test has an equals delta of Double.MAX_VALUE for many of 
the tests. I added a note to revisit this as the tests are not really checking 
anything. It has a computation method that is not compatible with the CS API. 
Moving it to CS would leave it an odd one out in that codebase rather than CM. 
I suggest leaving it in CM.

The VectorialCovariance and VectorialMean implement statistics on double[] 
rather than double. Their is no equivalent in CS.

The remaining classes are supporting summary statistics. These are now built on 
CS functionality. However they add features not in CS such as synchronization, 
aggregating statistics in sets, and storing all values to compute statistics 
using a (rolling) array. All of these are useful features building on CS.

The summary statistics also support changing the underlying implementation. 
This was of potential use when there were public implementations in the library 
(e.g. changing the variance to compute the population variance). I have 
previously used this feature to effectively disable computation of statistics 
using a no-op implementation for performance reasons (see below for performance 
of various statistics). Using it now to change implementation would require the 
user to write an implementation. So this feature is not so versatile as before.
h2. Further Changes

One item for discussion is the summary stats to support. The main summary 
interface StatisticalSummary supports min, max, count, sum, mean, variance and 
standard deviation. However the SummaryStatistics implementation for streamed 
values also supports the sum-of-squares, sum-of-logs and the derived geometric 
mean. This is an issue as I would argue that these statistics are not very 
widely used. However they have to be computed and will take a significant 
portion of computation time.

>From a recent benchmark in Commons Statistics (see STATISTICS-89), here is 
>table of the time taken to compute the statistics on an array of length 2^15:
|Statistic|Time|Cumulative Total|
|MIN|17777.1161| |
|MAX|17783.4962| |
|SUM_OF_SQUARES|26653.4201| |
|SUM|29301.2397| |
|MEAN|29550.9464| |
|VARIANCE|56871.0277|177937.246|
|SUM_OF_LOGS|332545.433| |

Computing the sum-of-logs takes twice as long as all the other faster 
statistics combined. It is useless if any values are negative or zero. I would 
propose removing this from the summary statistics. Note that the JDK's 
[DoubleSummaryStatistics|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/DoubleSummaryStatistics.html]
 only supports min, max, count, sum and average (mean). So the addition of 
variance by the Commons Math summary classes is very useful. The sum of squares 
is fast but also could be removed as it is not an essential statistic. It can 
still be computed on arrays using StatUtils and also by the 
DescriptiveStatistics which stores all values.

 

I have suggested a change to make the implementations of WeightedEvaluation 
into singletons (Mean, Variance, Product). I also note that Sum did support 
this interface's methods (as it was used by the Mean) but it did not implement 
the interface. This was probably an omission by the original author. At current 
I have removed Sum from the package but it could be reinstated and made to 
actually implement the interface. If leaving these classes only to support the 
weighted evaluation then I suggest changing their names to prefix with 
Weighted, e.g. WeightedMean.

 

> Reuse "Commons Statistics" (descriptive)
> ----------------------------------------
>
>                 Key: MATH-1671
>                 URL: https://issues.apache.org/jira/browse/MATH-1671
>             Project: Commons Math
>          Issue Type: Sub-task
>          Components: legacy
>    Affects Versions: 4.0-beta1
>            Reporter: Gilles Sadowski
>            Priority: Major
>              Labels: pull-request-available, refactor
>             Fix For: 4.0-beta2
>
>
> Reuse functionality in module 
> [{{descriptive}}|https://commons.apache.org/proper/commons-statistics/commons-statistics-docs/apidocs/org/apache/commons/statistics/descriptive/package-summary.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to