[ https://issues.apache.org/jira/browse/MATH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940737#comment-17940737 ]
Alex Herbert commented on MATH-1671: ------------------------------------ I have removed a large amount of the descriptive package in [PR 270|https://github.com/apache/commons-math/pull/260]. Many of the remaining computations are now delegated to Commons Statistics (CS). Changes are summarised in the commit history: {noformat} Removed percentile classes from descriptive.rank: CentralPivotingStrategy KthSelector Median MedianOf3PivotingStrategy Percentile PivotingStrategy RandomPivotingStrategy Refactor the SummaryStatistics default implementations to use DoubleStatistics. Removes method: getPopulationVariance getSecondMoment The population variance relies the second moment implementation which computes a statistic related to the central second moment. It is not a standard statistic and is not supported in Commons Statistics. Updates the min/max implementations to use Math.min/max. Previous behaviour ignored NaN values. The change now matches with JDK stream behaviour. Refactor the DescriptiveStatistics default implementations to use DoubleStatistics. Removes method: getPopulationVariance The variance implementation can be overridden if desired. Updates the min/max implementations to use Math.min/max. Previous behaviour ignored NaN values. The change now matches with JDK stream behaviour. Removes redundant classes. descriptive.moment: - FirstMoment - FourthMoment - GeometricMean - Kurtosis - SecondMoment - Skewness - StandardDeviation - ThirdMoment Mean + Variance have been changed to only implement the weighted evaluation interface. descriptive.rank: - Min - Max descriptive.summary: - Sum - SumOfLogs - SumOfSquares Product has been changed to only implement the weighted evaluation interface. The utility class StatUtils has been updated to delegate all calls to Commons Statistics. Legacy Math exceptions have been preserved. Removes methods to compute the variance using an existing mean: public static double variance(double[] values, double mean, int begin, int length) public static double variance(double[] values, double mean) public static double populationVariance(double[] values, double mean, int begin, int length) public static double populationVariance(double[] values, double mean) Note: StatUtils has inconsistent documentation of what to return for an empty array. The documentation states NaN but StatUtilsTest requires otherwise: Sum-of-squares = 0 Product = 1 Sum-of-logs = 0 This is inconsistent and has been updated to NaN for all statistics. The class MultivariateSummaryStatistics has been updated with partial implementations of StorelessUnivariateStatistic that delegate to Commons Statistics. Some test classes have been updated to pass the build after removal of the statistic implementations. {noformat} Here is what is left: {noformat} descriptive.moment - Mean - SemiVariance - Variance - VectorialCovariance - VectorialMean descriptive.rank - PSquarePercentile descriptive.summary: - Product descriptive: - AbstractStorelessUnivariateStatistic - AbstractUnivariateStatistic - AggregateSummaryStatistics - DescriptiveStatistics - MultivariateSummaryStatistics - StatisticalMultivariateSummary - StatisticalSummary - StatisticalSummaryValues - StorelessUnivariateStatistic - SummaryStatistics - SynchronizedDescriptiveStatistics - SynchronizedMultivariateSummaryStatistics - SynchronizedSummaryStatistics - UnivariateStatistic - WeightedEvaluation {noformat} The Mean, Variance and Product are only present to implement the WeightedEvaluation interface. These are stateless classes and could be made into a singleton. Weighted evaluation is not supported in Commons Statistics (CS). The SemiVariance is not supported in CS. The [semi-variance (wikipedia)|https://en.wikipedia.org/wiki/Variance#Semivariance] is computed using all deviations on one side of a threshold (e.g. the mean). CS implementations support streaming values and can be combined, e.g. adding two variances together. It is not valid to add two semi variances with a different cut-off, and if the cut-off is dynamic (e.g. the mean) then it is not possible to compute on streamed values as each update to the mean would require reassessing all previous values that are on one side of the new cut-off. The PSquarePercentile is a method to compute a percentile estimate on a stream of values without storing them all. It works but is very slow. The PSquarePercentileTest.testDistribution test takes ~8 seconds. When refactoring I noticed that this test has an equals delta of Double.MAX_VALUE for many of the tests. I added a note to revisit this as the tests are not really checking anything. It has a computation method that is not compatible with the CS API. Moving it to CS would leave it an odd one out in that codebase rather than CM. I suggest leaving it in CM. The VectorialCovariance and VectorialMean implement statistics on double[] rather than double. Their is no equivalent in CS. The remaining classes are supporting summary statistics. These are now built on CS functionality. However they add features not in CS such as synchronization, aggregating statistics in sets, and storing all values to compute statistics using a (rolling) array. All of these are useful features building on CS. The summary statistics also support changing the underlying implementation. This was of potential use when there were public implementations in the library (e.g. changing the variance to compute the population variance). I have previously used this feature to effectively disable computation of statistics using a no-op implementation for performance reasons (see below for performance of various statistics). Using it now to change implementation would require the user to write an implementation. So this feature is not so versatile as before. h2. Further Changes One item for discussion is the summary stats to support. The main summary interface StatisticalSummary supports min, max, count, sum, mean, variance and standard deviation. However the SummaryStatistics implementation for streamed values also supports the sum-of-squares, sum-of-logs and the derived geometric mean. This is an issue as I would argue that these statistics are not very widely used. However they have to be computed and will take a significant portion of computation time. >From a recent benchmark in Commons Statistics (see STATISTICS-89), here is >table of the time taken to compute the statistics on an array of length 2^15: |Statistic|Time|Cumulative Total| |MIN|17777.1161| | |MAX|17783.4962| | |SUM_OF_SQUARES|26653.4201| | |SUM|29301.2397| | |MEAN|29550.9464| | |VARIANCE|56871.0277|177937.246| |SUM_OF_LOGS|332545.433| | Computing the sum-of-logs takes twice as long as all the other faster statistics combined. It is useless if any values are negative or zero. I would propose removing this from the summary statistics. Note that the JDK's [DoubleSummaryStatistics|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/DoubleSummaryStatistics.html] only supports min, max, count, sum and average (mean). So the addition of variance by the Commons Math summary classes is very useful. The sum of squares is fast but also could be removed as it is not an essential statistic. It can still be computed on arrays using StatUtils and also by the DescriptiveStatistics which stores all values. I have suggested a change to make the implementations of WeightedEvaluation into singletons (Mean, Variance, Product). I also note that Sum did support this interface's methods (as it was used by the Mean) but it did not implement the interface. This was probably an omission by the original author. At current I have removed Sum from the package but it could be reinstated and made to actually implement the interface. If leaving these classes only to support the weighted evaluation then I suggest changing their names to prefix with Weighted, e.g. WeightedMean. > Reuse "Commons Statistics" (descriptive) > ---------------------------------------- > > Key: MATH-1671 > URL: https://issues.apache.org/jira/browse/MATH-1671 > Project: Commons Math > Issue Type: Sub-task > Components: legacy > Affects Versions: 4.0-beta1 > Reporter: Gilles Sadowski > Priority: Major > Labels: pull-request-available, refactor > Fix For: 4.0-beta2 > > > Reuse functionality in module > [{{descriptive}}|https://commons.apache.org/proper/commons-statistics/commons-statistics-docs/apidocs/org/apache/commons/statistics/descriptive/package-summary.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)