Re: PEP 450 Adding a statistics module to Python

chris . barker Fri, 16 Aug 2013 12:07:34 -0700

On Friday, August 16, 2013 10:15:52 AM UTC-7, Oscar Benjamin wrote:
> On 16 August 2013 17:31,  <chris.bar...@noaa.gov> wrote:
> Although it doesn't mention this in the PEP, a significant point that
> 
> is worth bearing in mind is that numpy is only for CPython, not PyPy,
> 
> IronPython, Jython etc. See here for a recent update  on the status of


It does mention it, though I think not the additional implementations by name. 
And yes, the lack of numpy on the other implementation is a major limitation.
 
> > "crunching numbers in python without numpy is like doing text processing 
> > without using the string object"
> 
> It depends what kind of number crunching you're doing.

As it depends on what kind of text processing your doing.....you could go a 
long way with a pure-python sequence of abstract characters library,  but it 
would be painfully slow -- no one would even try.

I guess there are more people working with, say, hundreds of numbers, than 
people trying to process an equally tiny amount of text...but this is a 
digression.

My point about that is that you can only reasonably do string processing with 
python because python has the concept of a string, not just an arbitrary 
sequence of characters, and not just for speed's sake, but for the nice 
semantics.

Anyone that has used an array-oriented language or library is likely to get 
addicted to the idea that an array of numbers as a first class concept is 
really, really helpful, for both performance and semantics.

> Numpy gives efficient C-style number crunching

which is the vastly most common case. Also, a properly designed algorithm may 
well need to know something about the internal storage/processing of the data 
type -- i.e. the best way to compute a given statistic for floating point may 
not be the same as for integers (or decimal, or...). Maybe you can get a good 
one that works for most, but....


> You can use  dtype=object to use all these things with numpy arrays but in my 
> experience this is typically not faster than working with Python lists

That's quite true. In fact, often slower.

> and is only really useful when you want numpy's multi-dimensional, 
> view-type slicing.

which is very useful indeed!

> Here's an example where Steven's statistics module is more accurate:

>     >>> numpy.mean([-1e60, 100, 100, 1e60])
> 
>     0.0
> 
>     >>> statistics.mean([-1e60, 100, 100, 1e60])
> 
>     50.0

the wonders of floating point arithmetic! -- but this looks like more of an 
argument for a better algorithm in numpy, than a reason to have something in 
the stdlib -- in fact, that's been discussed lately, there is talk of using 
compensated summation in the numpy sum() method -- not sure of the status.

> Okay so that's a toy example but it illustrates that Steven is aiming 
> for ultra-high accuracy where numpy is primarily aimed at speed. 

well, yes, for the most part, numpy does trade speed for accuracy when it has 
too -- but that's not the case here, I think this is ta case of "no one took 
the time to write a better algorithm"

He's also tried to ensure that it works properly with e.g. fractions:

That is pretty cool, yes.

> > What this is really an argument for is a numpy-lite in the standard 
> > library, which could be used to build these sorts of things on. But that's 
> > been rejected before...
> 
> If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
> provide much of what numpy provides.

I wasn't clear -- my point was that things like this should be build on a 
numpy-like array object (numpy-lite) -- so first adding such an object to the 
stdlib, then building this off it would be nice. But a key problem with that is 
where do you draw the line that defines numpy-lite? I"d say jsut the core 
storage object, but then someone wants to add statistics, and someone else 
wants to add polynomial, and then random numbers, then ... and pretty sure 
you've got numpy again!

> > All that being said -- if you do decide to do this, please use a PEP 3118 
> > (enhanced buffer) supporting data type (probably array.array) -- 
> > compatibility with numpy and other packages for crunching numbers is very 
> > nice.
> 
> > If someone decides to build a stand-alone stats package -- building it on a 
> > ndarray-lite (PEP 3118 compatible) object would be a nice way to go.
> 
> Why? Yes I'd also like an ndarray-lite or rather an ultra-lite 
> 1-dimensional version but why would it be useful for the statistics 
> module over using standard Python containers? Note that numpy arrays 
> do work with the reference implementation of the statistics module 
> (they're just treated as iterables):

One of the really great things about numpy is that when you work with a LOT of 
numbers (which is not rare in this era of Big Data) it stores them efficiently, 
and you can push them around between different arrays, and other libraries 
without unpacking and copying data. That's what PEP 3118 is all about.

It looks like there is some real care being put into these algorithms, so it 
would be nice if they could be efficiently used for large data sets and with 
numpy.


>     >>> import numpy
>     >>> import statistics 
>     >>> statistics.mean(numpy.array([1, 2, 3]))

you'll probably find that this is slower than a python list -- numpy has some 
overhead when used as a generic sequence.

 > > One other point -- for performance reason, is would be nice to have some 
 > > compiled code in there -- this adds incentive to put it in the stdlib -- 
 > > external packages that need compiling is what makes numpy unacceptable to 
 > > some folks.
>  
> It might be good to have a C accelerator one day but actually I think 
> the pure-Python-ness of it is a strong reason to have it since it 
> provides accurate statistics functions to all Python implementations 
> (unlike numpy) at no additional cost.

Well, I'd rather not have a package that is great for education and  toy 
problems, but not-so-good for the real ones...

I guess my point is this:

This is a way to make the standard python distribution better for some common 
computational tasks. But rather than think of it as "we need some stats 
functions in the python stdlib", perhaps we should be thinking: "out of the box 
python should be better for computation" -- in which case, I'd start with a 
decent array object.

-Chris




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: PEP 450 Adding a statistics module to Python

Reply via email to