On 26.08.2021 02:36, Finn Mason wrote:
> Perhaps a warning could be raised but the NaNs are ignored. For example:
>
> Input: statistics.mean([4, 2, float('nan')])
> Output: [warning blah blah blah]
> 3
>
> Or the NaNs could be treated as zeros and a warning raised:
>
> Input: statistics.mean([4, 2, float('nan')])
> Output: [warning blah blah blah]
> 2
>
> I do feel there should be a catchable warning but not an outright exception,
> and
> a non-NaN value should still be returned. This allows calculations to still
> quickly and easily be made with or without NaNs, but an alternative course of
> action can be taken in the presence of a NaN value if desired.
With the keyword argument, you can decide what to do.
As for the default: for codecs we made raising an exeception
the default, simply because this highlights the need to make
an explicit decision.
For long running calculations this may not be desirable, but then
getting NAN as end result isn't the best compromise either.
In practice it's better to check for NANs before entering a
calculation and then apply case specific handling, e.g. replace
NANs with fixed default values, remove them, use a different
heuristic for the calculation, stop the calculation and ask
for better input, etc. etc.
There are many ways to process things in the face of NANs.
In Python you can use a simple test for this:
>>> nan = float('nan')
>>> l = [1,2,3,nan]
>>> d = {nan:1, 2:3, 4:5, 5:nan}
>>> s = set(l)
>>> nan in l
True
>>> nan in d
True
>>> nan in s
True
but this really only makes sense for smaller data sets. If you
have a large data set where you rarely get NANs, using the
keyword argument may indeed be a better way to go about this.
> In any case, the current behavior should definitely be changed.
Indeed. The NAN handling in median() looks like a bug, more than
anything else:
>>> import statistics
>>> statistics.mean(l)
nan
>>> statistics.mean(d)
nan
>>> statistics.mean(s)
nan
>>> l1 = [1,2,nan,4]
>>> statistics.mean(l1)
nan
>>> l2 = [nan,1,2,4]
>>> statistics.mean(l2)
nan
>>> statistics.median(l)
2.5
>>> statistics.median(l1)
nan
>>> statistics.median(l2)
1.5
> On Tue, Aug 24, 2021, 1:46 AM Marc-Andre Lemburg <[email protected]
> <mailto:[email protected]>> wrote:
>
> On 24.08.2021 05:53, Steven D'Aprano wrote:
> > At the moment, the handling of NANs in the statistics module is
> > implementation dependent. In practice, that *usually* means that if your
> > data has a NAN in it, the result you get will probably be a NAN.
> >
> > >>> statistics.mean([1, 2, float('nan'), 4])
> > nan
> >
> > But there are unfortunate exceptions to this:
> >
> > >>> statistics.median([1, 2, float('nan'), 4])
> > nan
> > >>> statistics.median([float('nan'), 1, 2, 4])
> > 1.5
> >
> > I've spoken to users of other statistics packages and languages, such as
> > R, and I cannot find any consensus on what the "right" behaviour should
> > be for NANs except "not that!".
> >
> > So I propose that statistics functions gain a keyword only parameter to
> > specify the desired behaviour when a NAN is found:
> >
> > - raise an exception
> >
> > - return NAN
> >
> > - ignore it (filter out NANs)
> >
> > which seem to be the three most common preference. (It seems to be
> > split roughly equally between the three.)
> >
> > Thoughts? Objections?
>
> Sounds good. This is similar to the errors argument we have
> for codecs where users can determine what the behavior should be
> in case of an error in processing.
>
> > Does anyone have any strong feelings about what should be the default?
>
> No strong preference, but if the objective is to continue calculations
> as much as possible even in the face of missing values, returning NAN
> is the better choice.
>
> Second best would be an exception, IMO, to signal: please be explicit
> about what to do about NANs in the calculation. It helps reduce the
> needed backtracking when the end result of a calculation
> turns out to be NAN.
>
> Filtering out NANs should always be an explicit choice to make.
> Ideally such filtering should happen *before* any calculations
> get applied. In some cases, it's better to replace NANs with
> use case specific default values. In others, removing them is the
> right thing to do.
>
> Note that e.g. SQL defaults to ignoring NULLs in aggregate functions
> such as AVG(), so there are standard precedents for ignoring NAN values
> per default as well. And yes, that default can lead to wrong results
> in reports which are hard to detect.
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Aug 24 2021)
> >>> Python Projects, Coaching and Support ... https://www.egenix.com/
> >>> Python Product Development ... https://consulting.egenix.com/
> ________________________________________________________________________
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> https://www.egenix.com/company/contact/
> https://www.malemburg.com/
>
> _______________________________________________
> Python-ideas mailing list -- [email protected]
> <mailto:[email protected]>
> To unsubscribe send an email to [email protected]
> <mailto:[email protected]>
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
>
> https://mail.python.org/archives/list/[email protected]/message/L5QB4GUPYXNYBFKG43VSGOWVE27Y5BIF/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
>
> _______________________________________________
> Python-ideas mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/SSGI4JJMLXU52QMB2BRTSII7YBII2N7R/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Experts (#1, Aug 26 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/GKITVTHZ55SEP4NDHJR3G6KAGRBSXXEC/
Code of Conduct: http://python.org/psf/codeofconduct/