[ https://issues.apache.org/jira/browse/STATISTICS-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432981#comment-17432981 ]
Alex Herbert commented on STATISTICS-37: ---------------------------------------- This has been fixed by NUMBERS-171. The tests have been updated to use higher tolerances. The normal distribution tolerance was raised to approximately 3.5e-15. This is now one of the the most accurate distributions compared to reference data. Note that the NormalDistribution computed the value to pass to the error function as: {noformat} xx = (x - mu) / (sd * sqrt(2)) {noformat} This differs from the other implementations which appear to compute (verified by reverse engineering their results): {noformat} xx = (x - mu) / sd / sqrt(2) {noformat} The present implementation has only 1 divide. The factor sqrt(2 * sd * sd) can be computed in the constructor. To maximise precision I have implemented this using extended precision floating point operations. The method is exact when tested on 500 random values in [1, 2). For reference the following errors are observed on the same data for standard precision computations: ||Metho||Max Error||RMS error|| |sd * sqrt(2)| 1.0598|0.4781| |Sqrt(2 * sd * sd)|0.7780| 0.2144| These have been measured using BigDecimal from an extended precision result to 25 digits. The sd * sqrt(2) method was the original method. It has the largest errors. Replacing this with a method that computes the exact factor eliminates a source of error. This is warranted as the new error function in Numbers has max 2 ULP error. I updated the TruncatedNormalDistribution to exploit the high precision computation for a range between x0 and x1. This is based on the ErfDifference class in numbers. It is more accurate than the previous method using cdf(x1) - cdf(x0). This distribution now computes very accurate probabilities. The distribution has a numerically unstable computation for the mean and variance. When using the updated logic for the range this resulted in a variance of zero for one test case due to cancellation. I have left the computation of moments alone. These moments are not used in the rest of the distribution. The inverse CDF uses the inverse error function not a default search bounded by a range defined by the mean and variance. A ticket should be raised to track improvements to the moments computation. The levy distribution tolerance was raised from 1e-9 to 1e-14. Some test data required regeneration. > Better inverse Erfc function for the Normal distribution > -------------------------------------------------------- > > Key: STATISTICS-37 > URL: https://issues.apache.org/jira/browse/STATISTICS-37 > Project: Apache Commons Statistics > Issue Type: Improvement > Components: distribution > Affects Versions: 1.0 > Reporter: Alex Herbert > Priority: Major > > The Normal distribution uses InverseErfc for the inverse CDF. This is a > wrapper for InverseErf and as such it is limited in precision as p -> 0 since > the method computes using (2p - 1). When p is < 1e-16 then the inverse CDF > returns -infinity. The smallest z is around -8.2 > as computed by matlab: > norminv(1e-16) > -8.222082216130437 > scipy, R, matlab, octave, Mathematica all have methods to invert small p. > Here is output from matlab for norminv(5e-324): -38.472346342768788. It is > accurate down to the minimum value for a 64-bit float. > There are two free libraries that seem to be used for this: Boost and Cephes > (function ntdri). The Boost licence is compatible with Apache and there are > some Boost derived works already in Commons Numbers. Incorporating the Boost > Erf functions into numbers would be useful. > The inability to invert the Erfc for all p affects the Normal, Truncated > normal and Levy distributions. > Boost error function inverses: > [https://www.boost.org/doc/libs/1_77_0/libs/math/doc/html/math_toolkit/sf_erf/error_inv.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)