Re: [Python-Dev] PEP 450 adding statistics module

2013-08-16 Thread Oscar Benjamin
On 15 August 2013 14:08, Steven D'Aprano  wrote:
>
> - The API doesn't really feel very Pythonic to me. For example, we write:
>
> mystring.rjust(width)
> dict.items()
>
> rather than mystring.justify(width, "right") or dict.iterate("items"). So I
> think individual methods is a better API, and one which is more familiar to
> most Python users. The only innovation (if that's what it is) is to have
> median a callable object.

Although you're talking about median() above I think that this same
reasoning applies to the mode() signature. In the reference
implementation it has the signature:

def mode(data, max_modes=1):
...

The behaviour is that with the default max_modes=1 it will return the
unique mode or raise an error if there isn't a unique mode:

>>> mode([1, 2, 3, 3])
3
>>> mode([])
StatisticsError: no mode
>>> mode([1, 1, 2, 3, 3])
AssertionError

You can use the max_modes parameter to specify that more than one mode
is acceptable and setting max_modes to 0 or None returns all modes no
matter how many. In these cases mode() returns a list:

>>> mode([1, 1, 2, 3, 3], max_modes=2)
[1, 3]
>>> mode([1, 1, 2, 3, 3], max_modes=None)
[1, 3]

I can't think of a situation where 1 or 2 modes are acceptable but 3
is not. The only forms I can imagine using are mode(data) to get the
unique mode if it exists and mode(data, max_modes=None) to get the set
of all modes. But for that usage it would be better to have a boolean
flag and then either way you're at the point where it would normally
become two functions.

Also I dislike changing the return type based on special numeric values:
>>> mode([1, 2, 3, 3], max_modes=0)
[3]
>>> mode([1, 2, 3, 3], max_modes=1)
3
>>> mode([1, 2, 3, 3], max_modes=2)
[3]
>>> mode([1, 2, 3, 3], max_modes=3)
[3]

My preference would be to have two functions, one called e.g. modes()
and one called mode(). modes() always returns a list of the most
frequent values no matter how many. mode() returns a unique mode if
there is one or raises an error. I think that that would be simpler to
document and easier to learn and use. If the user is for whatever
reason happy with 1 or 2 modes but not 3 then they can call modes()
and check for themselves.

Also I think that:
>>> modes([])
[]
but I expect others to disagree.


Oscar
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 450 adding statistics module

2013-08-16 Thread Antoine Pitrou
On Fri, 16 Aug 2013 12:44:54 +1000
Steven D'Aprano  wrote:

> On 16/08/13 04:10, Eric V. Smith wrote:
> 
> > I agree with Mark: the proposed median, median.low, etc., doesn't feel
> > right. Is there any example of doing this in the stdlib?
> 
> The most obvious case is datetime: we have datetime(), and datetime.now(), 
> datetime.today(), and datetime.strftime(). The only API difference between it 
> and median is that datetime is a type and median is not, but that's a 
> difference that makes no difference:

Of course it does. The datetime classmethods return datetime instances,
which is why it makes sense to have them classmethods (as opposed to
module functions).

The median functions, however, don't return median instances.

> My preference is to make median a singleton instance with a __call__ method, 
> and the other flavours regular methods. Although I don't like polluting the 
> global namespace with an unnecessary class that will only be instantiated 
> once, if it helps I can do this:
> 
> class _Median:
>  def __call__(self, data): ...
>  def low(self, data): ...
> 
> median = _Median()
> 
> If that standard OOP design is unacceptable, I will swap the dots for 
> underscores, but I won't like it.

Using "OOP design" for something which is conceptually not OO
(you are just providing callables in the end, not types and objects:
your _Median "type" doesn't carry any state) is not really standard in
Python. It would be in Java :-)

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 450 adding statistics module

2013-08-16 Thread Mark Shannon



On 15/08/13 14:08, Steven D'Aprano wrote:

On 15/08/13 21:42, Mark Dickinson wrote:

The PEP and code look generally good to me.

I think the API for median and its variants deserves some wider discussion:
the reference implementation has a callable 'median', and variant callables
'median.low', 'median.high', 'median.grouped'.  The pattern of attaching
the variant callables as attributes on the main callable is unusual, and
isn't something I've seen elsewhere in the standard library.  I'd like to
see some explanation in the PEP for why it's done this way.  (There was
already some discussion of this on the issue, but that was more centered
around the implementation than the API.)

I'd propose two alternatives for this:  either have separate functions
'median', 'median_low', 'median_high', etc., or have a single function
'median' with a "method" argument that takes a string specifying
computation using a particular method.  I don't see a really good reason to
deviate from standard patterns here, and fear that users would find the
current API surprising.


Alexander Belopolsky has convinced me (off-list) that my current implementation 
is better changed to a more conservative one of a callable singleton instance 
with methods implementing the alternative
computations. I'll have something like:


def _singleton(cls):
 return cls()


@_singleton
class median:
 def __call__(self, data):
 ...
 def low(self, data):
 ...
 ...


Horrible.



In my earlier stats module, I had a single median function that took a argument to choose 
between alternatives. I called it "scheme":

median(data, scheme="low")


What is wrong with this?
It's a perfect API; simple and self-explanatory.
median is a function in the mathematical sense and it should be a function in 
Python.



R uses parameter called "type" to choose between alternate calculations, not 
for median as we are discussing, but for quantiles:

quantile(x, probs ... type = 7, ...).

SAS also uses a similar system, but with different numeric codes. I rejected both "type" 
and "method" as the parameter name since it would cause confusion with the usual meanings 
of those words. I
eventually decided against this system for two reasons:


There are other words to choose from ;) "scheme" seems OK to me.



- Each scheme ended up needing to be a separate function, for ease of both 
implementation and testing. So I had four private median functions, which I put 
inside a class to act as namespace and avoid
polluting the main namespace. Then I needed a "master function" to select which 
of the methods should be called, with all the additional testing and documentation that 
entailed.

- The API doesn't really feel very Pythonic to me. For example, we write:

mystring.rjust(width)
dict.items()

These are methods on objects, the result of these calls depends on the value of 
'self' argument, not merely its class. No so with a median singleton.

We also have len(seq) and copy.copy(obj)
No classes required.



rather than mystring.justify(width, "right") or dict.iterate("items"). So I 
think individual methods is a better API, and one which is more familiar to most Python users. The 
only innovation (if
that's what it is) is to have median a callable object.


As far as having four separate functions, median, median_low, etc., it just 
doesn't feel right to me. It puts four slight variations of the same function 
into the main namespace, instead of keeping
them together in a namespace. Names like median_low merely simulates a 
namespace with pseudo-methods separated with underscores instead of dots, only 
without the advantages of a real namespace.

(I treat variance and std dev differently, and make the sample and population 
forms separate top-level functions rather than methods, simply because they are 
so well-known from scientific calculators
that it is unthinkable to me to do differently. Whenever I use numpy, I am 
surprised all over again that it has only a single variance function.)




___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 450 adding statistics module

2013-08-16 Thread Steven D'Aprano

On 16/08/13 17:47, Oscar Benjamin wrote:

I can't think of a situation where 1 or 2 modes are acceptable but 3
is not. The only forms I can imagine using are mode(data) to get the
unique mode if it exists and mode(data, max_modes=None) to get the set
of all modes.


Hmmm, I think you are right. The current design is leftover from when mode also 
supported continuous data, and it made more sense there.



But for that usage it would be better to have a boolean
flag and then either way you're at the point where it would normally
become two functions.


Alright, you've convinced me. I'll provide two functions: mode, which returns 
the single value with the highest frequency, or raises; and a second function, 
which collates the data into a sorted (value, frequency) list. Bike-shedding on 
the name of this second function is welcomed :-)



--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 450 adding statistics module

2013-08-16 Thread Oscar Benjamin
On Aug 16, 2013 11:05 AM, "Steven D'Aprano" 
wrote:
>
> I'll provide two functions: mode, which returns the single value with the
highest frequency, or raises; and a second function, which collates the
data into a sorted (value, frequency) list. Bike-shedding on the name of
this second function is welcomed :-)

I'd call it counts() and prefer an OrderedDict for easy lookup. By that
point you're very close to Counter though (which it currently uses
internally).

Oscar
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Summary of Python tracker Issues

2013-08-16 Thread Python tracker

ACTIVITY SUMMARY (2013-08-09 - 2013-08-16)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open4152 ( +4)
  closed 26377 (+56)
  total  30529 (+60)

Open issues with patches: 1896 


Issues opened (51)
==

#17477: update the bsddb module do build with db 5.x versions
http://bugs.python.org/issue17477  reopened by jcea

#18693: help() not helpful with enum
http://bugs.python.org/issue18693  reopened by ethan.furman

#18697: Unify arguments names in Unicode object C API documentation
http://bugs.python.org/issue18697  opened by serhiy.storchaka

#18699: What is Future.running() for in PEP 3148 / concurrent.futures.
http://bugs.python.org/issue18699  opened by gvanrossum

#18701: Remove outdated PY_VERSION_HEX checks
http://bugs.python.org/issue18701  opened by serhiy.storchaka

#18702: Report skipped tests as skipped
http://bugs.python.org/issue18702  opened by serhiy.storchaka

#18703: To change the doc of html/faq/gui.html
http://bugs.python.org/issue18703  opened by madan.ram

#18704: IDLE: PEP8 Style Check Integration
http://bugs.python.org/issue18704  opened by JayKrish

#18705: Fix typos/spelling mistakes in Lib/*.py files
http://bugs.python.org/issue18705  opened by iwontbecreative

#18706: test failure in test_codeccallbacks
http://bugs.python.org/issue18706  opened by pitrou

#18707: the readme should also talk about how to build doc.
http://bugs.python.org/issue18707  opened by madan.ram

#18709: SSL module fails to handle NULL bytes inside subjectAltNames g
http://bugs.python.org/issue18709  opened by christian.heimes

#18710: Add PyState_GetModuleAttr
http://bugs.python.org/issue18710  opened by pitrou

#18711: Add PyErr_FormatV
http://bugs.python.org/issue18711  opened by pitrou

#18712: Pure Python operator.index doesn't match the C version.
http://bugs.python.org/issue18712  opened by mark.dickinson

#18713: Enable surrogateescape on stdin and stdout when appropriate
http://bugs.python.org/issue18713  opened by ncoghlan

#18714: Add tests for pdb.find_function
http://bugs.python.org/issue18714  opened by kevinjqiu

#18715: Tests fail when run with coverage
http://bugs.python.org/issue18715  opened by seydou

#18716: Deprecate the formatter module
http://bugs.python.org/issue18716  opened by brett.cannon

#18717: test for request.urlretrieve
http://bugs.python.org/issue18717  opened by mjehanzeb

#18718: datetime documentation contradictory on leap second support
http://bugs.python.org/issue18718  opened by wolever

#18720: Switch suitable constants in the socket module to IntEnum
http://bugs.python.org/issue18720  opened by eli.bendersky

#18723: shorten function of textwrap module is susceptible to   non-norm
http://bugs.python.org/issue18723  opened by vajrasky

#18725: Multiline shortening
http://bugs.python.org/issue18725  opened by serhiy.storchaka

#18726: json functions have too many positional parameters
http://bugs.python.org/issue18726  opened by serhiy.storchaka

#18727: test for writing dictionary rows to CSV
http://bugs.python.org/issue18727  opened by mjehanzeb

#18728: Increased test coverage for filecmp.py
http://bugs.python.org/issue18728  opened by Alex.Volkov

#18729: In unittest.TestLoader.discover doc select the name of load_te
http://bugs.python.org/issue18729  opened by py.user

#18730: suffix parameter in NamedTemporaryFile silently fails when not
http://bugs.python.org/issue18730  opened by dloewenherz

#18731: Increased test coverage for uu and telnet
http://bugs.python.org/issue18731  opened by Alex.Volkov

#18733: elementtree: stop the parser more quickly on error
http://bugs.python.org/issue18733  opened by haypo

#18734: Berkeley DB versions 4.4-4.9 are not discovered by setup.py
http://bugs.python.org/issue18734  opened by Eddie.Stanley

#18736: Invalid charset in HTML pages inside documentation in CHM form
http://bugs.python.org/issue18736  opened by grv87

#18737: Get virtual subclasses of an ABC
http://bugs.python.org/issue18737  opened by christian.heimes

#18738: String formatting (% and str.format) issues with Enum
http://bugs.python.org/issue18738  opened by ethan.furman

#18739: math.log of a long returns a different value of math.log of an
http://bugs.python.org/issue18739  opened by gregory.p.smith

#18741: Fix typos/spelling mistakes in Lib/*/*/.py files
http://bugs.python.org/issue18741  opened by iwontbecreative

#18742: Abstract base class for hashlib
http://bugs.python.org/issue18742  opened by christian.heimes

#18743: References to non-existant "StringIO" module
http://bugs.python.org/issue18743  opened by jcea

#18744: pathological performance using tarfile
http://bugs.python.org/issue18744  opened by teamnoir

#18745: Test enum in test_json is ignorant of infinity value
http://bugs.python.org/issue18745  opened by vajrasky

#18746: test_threading.test_finalize_with_trace() fails on FreeBSD bui
ht