On Sun, Dec 29, 2019 at 4:05 PM Christopher Barker <[email protected]>
wrote:
>
>> You mean performance? Sure, but as I've argued before (no idea if anyone
> agrees with me) the statistics package is already not a high performance
> package anyway. If it turns out that it slows it down by, say, a factor of
> two or more, then yes, maybe we need to forget it.
>
You never know 'till you profile, so I did a quick experiment -- adding a
NaN filter is substantial overhead:
This is for a list of 10,000 random floats (no nans in there, but the check
is made by pre-filtering with a generator comprehension)
# this just calls statistics.median directly
In [14]: %timeit plainmedian(lots_of_floats)
1.54 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# this filters with math.isnan()
In [15]: %timeit nanmedianfloat(lots_of_floats)
3.5 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# this filters with a complex NAN-checker that works with most types and
values: floats, Decimals, numpy scalars, ...
In [16]: %timeit nanmedian(lots_of_floats)
13.5 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the simple math,isnan filter slows it down by a factor of a bit more
than two -- maybe tolerable. and the full featured isnan checker by almost
a factor of ten -- that's pretty bad.
I suspect if it were inline more, it could be median bit faster, and I'm
sure the nan-checking code could be better optimized, but this is a pretty
big hit.
Note that numpy has a number of "nan*" functions, for nan-aware versions
that treat NaN as missing values (including nanquantile) -- we could take a
similar route, and have new names or a flag to disable or enable
nan-checking.
Code enclosed
- CHB
--
Christopher Barker, PhD
Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
# some tests of the impact of NaN-checking on statistics functions
import math
import cmath
import statistics
import random
# A few big lists for testing:
lots_of_floats = [random.random() for __ in range(10000)]
lots_with_nans = lots_of_floats[100:] + [[float("NaN")] * 100]
random.shuffle(lots_with_nans)
def is_nan(num):
"""
This version works for everything I've tried
"""
try:
return num.is_nan()
except AttributeError:
if isinstance(num, complex):
return cmath.isnan(num)
try:
return math.isnan(num)
except:
return False
def nanmedian(numbers):
"""
a version of median that filters out NaN values
"""
return statistics.median((num for num in numbers if not is_nan(num)))
def nanmedianfloat(numbers):
"""
a version of median that filters out NaN values -- but only for
values that math.isnan works on
"""
return statistics.median((num for num in numbers if not math.isnan(num)))
def plainmedian(numbers):
"""
jsut a wrapper to equalize the function call overhead
"""
return statistics.median(numbers)
# a couple sanity checks:
ints = [1, 2, 3, 4, 5, 6]
assert nanmedian(ints) == statistics.median(ints)
floats = [1.0, 2.2, 3.3, 4.4, 5.5, 6.3]
assert nanmedian(floats) == statistics.median(floats)
floats_with_nan = floats[:] + [float("NaN")] * 3
random.shuffle(floats_with_nan)
assert nanmedian(floats_with_nan) == statistics.median(floats)
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/643XH5Q4CPM4TZOWHUWTOGLCJ7OHD5IW/
Code of Conduct: http://python.org/psf/codeofconduct/