[issue37905] Remove NormalDist.overlap() or improve documentation?

Christoph Deil Wed, 21 Aug 2019 05:39:18 -0700


New submission from Christoph Deil <deil.christ...@googlemail.com>:


I saw that Python 3.8 will add a NormalDist class:
https://docs.python.org/3.8/library/statistics.html#normaldist-objects

Personally I don't see the value of adding this to the Python standard lib. The 
natural progression would be to extend and extend, but in the end only 
duplicate what already exists in scientific Python packages.
But Ok, I guess this is not up for debate any more?

I'd like to make a specific comment on NormalDist.overlap.
The rest of NormalDist is very standard, but that method is an oddball.
My suggestion is to remove it or to improve the documentation.

Current docstring: 
https://github.com/python/cpython/blob/44f2c096804e8e3adc09400a59ef9c9ae843f339/Lib/statistics.py#L959-L991

And this docs example:
https://github.com/python/cpython/commit/318d537daabf2bd5f781255c7e25bfce260cf227#diff-d436928bc44b5d7c40a8047840f55d35R620-R629


> What percentage of men and women will have the same height in `two normally
distributed populations with known means and standard deviations
<http://www.usablestats.com/lessons/normal>`_?

50.3%

This statement doesn't make sense to me. No two people have the exact same 
height, I think the answer to this question should be 0%.

Using

n = 100_000; sum(m > w for m, w in zip(men.samples(n), women.samples(n))) / n

I see that for 82% of random (men, women) matches the man will be larger. 
That's another measure, but still, stating that 50% of men and women have the 
same height is confusing.

Note that there is a multitude of PDF overlap measures different from this 
min(pdf1, pdf2) that I think are much more common in statistics and the 
physical sciences:
- https://en.wikipedia.org/wiki/Hellinger_distance
- https://arxiv.org/pdf/1407.7172.pdf

And note that the references that are given currently are weird (basic 
statistics textbooks would be appropriate references IMO, or open references 
like Wikipedia)
- slides: 
http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf
- implementation code comment points to 
http://dx.doi.org/10.1080/03610928908830127 which is behind a paywall

Why add this one overlap measure and expose it under the "overlap" method name?

My suggestion would be to be conservative and to remove that method again, 
before releasing it in 3.8. A reference in the docs could be added to other 
existing third-party codes (e.g. scipy or the uncertainties package) with 
further functionality, such as being able to handle correlations or 
multi-dimensional distributions. For this change I'd be happy to send a PR any 
time.

Raymond and others interested in this topic - thoughts?

(note: I wrote a MultiNorm class prototype last year at 
https://github.com/cdeil/multinorm/blob/master/multinorm.py and now wanted to 
rewrite it and try to find a good API and thus was interested in this 
NormalDist class and what functionality it offers)

----------
components: Library (Lib)
messages: 350076
nosy: Christoph.Deil, rhettinger
priority: normal
severity: normal
status: open
title: Remove NormalDist.overlap() or improve documentation?
type: enhancement
versions: Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37905>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue37905] Remove NormalDist.overlap() or improve documentation?

Reply via email to