[Python-ideas] Re: Fix statistics.median()?

Christopher Barker Sun, 29 Dec 2019 16:42:59 -0800

On Sun, Dec 29, 2019 at 4:05 PM Christopher Barker <[email protected]>
wrote:


>
>> You mean performance? Sure, but as I've argued before (no idea if anyone
> agrees with me) the statistics package is already not a high performance
> package anyway. If it turns out that it slows it down by, say, a factor of
> two or more, then yes, maybe we need to forget it.
>

You never know 'till you profile, so I did a quick experiment -- adding a
NaN filter is substantial overhead:

This is for a list of 10,000 random floats (no nans in there, but the check
is made by pre-filtering with a generator comprehension)

# this just calls statistics.median directly
In [14]: %timeit plainmedian(lots_of_floats)

1.54 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# this filters with math.isnan()
In [15]: %timeit nanmedianfloat(lots_of_floats)

3.5 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# this filters with a complex NAN-checker that works with most types and
values: floats, Decimals, numpy scalars, ...
In [16]: %timeit nanmedian(lots_of_floats)

13.5 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the simple math,isnan filter slows it down by a factor of a bit more
than two -- maybe tolerable. and the full featured isnan checker by almost
a factor of ten -- that's pretty bad.

I suspect if it were inline more, it could be median  bit faster, and I'm
sure the nan-checking code could be better optimized, but this is a pretty
big hit.

Note that numpy has a number of "nan*" functions, for nan-aware versions
that treat NaN as missing values (including nanquantile) -- we could take a
similar route, and have new names or a flag to disable or enable
nan-checking.

Code enclosed

- CHB

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

# some tests of the impact of NaN-checking on statistics functions

import math
import cmath
import statistics
import random

# A few big lists for testing:

lots_of_floats = [random.random() for __ in range(10000)]
lots_with_nans = lots_of_floats[100:] + [[float("NaN")] * 100]
random.shuffle(lots_with_nans)


def is_nan(num):
    """
    This version works for everything I've tried
    """
    try:
        return num.is_nan()
    except AttributeError:
        if isinstance(num, complex):
            return cmath.isnan(num)
        try:
            return math.isnan(num)
        except:
            return False


def nanmedian(numbers):
    """
    a version of median that filters out NaN values
    """
    return statistics.median((num for num in numbers if not is_nan(num)))


def nanmedianfloat(numbers):
    """
    a version of median that filters out NaN values -- but only for
    values that math.isnan works on
    """
    return statistics.median((num for num in numbers if not math.isnan(num)))


def plainmedian(numbers):
    """
    jsut a wrapper to equalize the function call overhead
    """
    return statistics.median(numbers)


# a couple sanity checks:
ints = [1, 2, 3, 4, 5, 6]
assert nanmedian(ints) == statistics.median(ints)

floats = [1.0, 2.2, 3.3, 4.4, 5.5, 6.3]
assert nanmedian(floats) == statistics.median(floats)

floats_with_nan = floats[:] + [float("NaN")] * 3
random.shuffle(floats_with_nan)
assert nanmedian(floats_with_nan) == statistics.median(floats)

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/643XH5Q4CPM4TZOWHUWTOGLCJ7OHD5IW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to