Re: [Python-Dev] PEP 450 adding statistics module
On 15 August 2013 14:08, Steven D'Aprano wrote:
>
> - The API doesn't really feel very Pythonic to me. For example, we write:
>
> mystring.rjust(width)
> dict.items()
>
> rather than mystring.justify(width, "right") or dict.iterate("items"). So I
> think individual methods is a better API, and one which is more familiar to
> most Python users. The only innovation (if that's what it is) is to have
> median a callable object.
Although you're talking about median() above I think that this same
reasoning applies to the mode() signature. In the reference
implementation it has the signature:
def mode(data, max_modes=1):
...
The behaviour is that with the default max_modes=1 it will return the
unique mode or raise an error if there isn't a unique mode:
>>> mode([1, 2, 3, 3])
3
>>> mode([])
StatisticsError: no mode
>>> mode([1, 1, 2, 3, 3])
AssertionError
You can use the max_modes parameter to specify that more than one mode
is acceptable and setting max_modes to 0 or None returns all modes no
matter how many. In these cases mode() returns a list:
>>> mode([1, 1, 2, 3, 3], max_modes=2)
[1, 3]
>>> mode([1, 1, 2, 3, 3], max_modes=None)
[1, 3]
I can't think of a situation where 1 or 2 modes are acceptable but 3
is not. The only forms I can imagine using are mode(data) to get the
unique mode if it exists and mode(data, max_modes=None) to get the set
of all modes. But for that usage it would be better to have a boolean
flag and then either way you're at the point where it would normally
become two functions.
Also I dislike changing the return type based on special numeric values:
>>> mode([1, 2, 3, 3], max_modes=0)
[3]
>>> mode([1, 2, 3, 3], max_modes=1)
3
>>> mode([1, 2, 3, 3], max_modes=2)
[3]
>>> mode([1, 2, 3, 3], max_modes=3)
[3]
My preference would be to have two functions, one called e.g. modes()
and one called mode(). modes() always returns a list of the most
frequent values no matter how many. mode() returns a unique mode if
there is one or raises an error. I think that that would be simpler to
document and easier to learn and use. If the user is for whatever
reason happy with 1 or 2 modes but not 3 then they can call modes()
and check for themselves.
Also I think that:
>>> modes([])
[]
but I expect others to disagree.
Oscar
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 450 adding statistics module
On Fri, 16 Aug 2013 12:44:54 +1000 Steven D'Aprano wrote: > On 16/08/13 04:10, Eric V. Smith wrote: > > > I agree with Mark: the proposed median, median.low, etc., doesn't feel > > right. Is there any example of doing this in the stdlib? > > The most obvious case is datetime: we have datetime(), and datetime.now(), > datetime.today(), and datetime.strftime(). The only API difference between it > and median is that datetime is a type and median is not, but that's a > difference that makes no difference: Of course it does. The datetime classmethods return datetime instances, which is why it makes sense to have them classmethods (as opposed to module functions). The median functions, however, don't return median instances. > My preference is to make median a singleton instance with a __call__ method, > and the other flavours regular methods. Although I don't like polluting the > global namespace with an unnecessary class that will only be instantiated > once, if it helps I can do this: > > class _Median: > def __call__(self, data): ... > def low(self, data): ... > > median = _Median() > > If that standard OOP design is unacceptable, I will swap the dots for > underscores, but I won't like it. Using "OOP design" for something which is conceptually not OO (you are just providing callables in the end, not types and objects: your _Median "type" doesn't carry any state) is not really standard in Python. It would be in Java :-) Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 450 adding statistics module
On 15/08/13 14:08, Steven D'Aprano wrote:
On 15/08/13 21:42, Mark Dickinson wrote:
The PEP and code look generally good to me.
I think the API for median and its variants deserves some wider discussion:
the reference implementation has a callable 'median', and variant callables
'median.low', 'median.high', 'median.grouped'. The pattern of attaching
the variant callables as attributes on the main callable is unusual, and
isn't something I've seen elsewhere in the standard library. I'd like to
see some explanation in the PEP for why it's done this way. (There was
already some discussion of this on the issue, but that was more centered
around the implementation than the API.)
I'd propose two alternatives for this: either have separate functions
'median', 'median_low', 'median_high', etc., or have a single function
'median' with a "method" argument that takes a string specifying
computation using a particular method. I don't see a really good reason to
deviate from standard patterns here, and fear that users would find the
current API surprising.
Alexander Belopolsky has convinced me (off-list) that my current implementation
is better changed to a more conservative one of a callable singleton instance
with methods implementing the alternative
computations. I'll have something like:
def _singleton(cls):
return cls()
@_singleton
class median:
def __call__(self, data):
...
def low(self, data):
...
...
Horrible.
In my earlier stats module, I had a single median function that took a argument to choose
between alternatives. I called it "scheme":
median(data, scheme="low")
What is wrong with this?
It's a perfect API; simple and self-explanatory.
median is a function in the mathematical sense and it should be a function in
Python.
R uses parameter called "type" to choose between alternate calculations, not
for median as we are discussing, but for quantiles:
quantile(x, probs ... type = 7, ...).
SAS also uses a similar system, but with different numeric codes. I rejected both "type"
and "method" as the parameter name since it would cause confusion with the usual meanings
of those words. I
eventually decided against this system for two reasons:
There are other words to choose from ;) "scheme" seems OK to me.
- Each scheme ended up needing to be a separate function, for ease of both
implementation and testing. So I had four private median functions, which I put
inside a class to act as namespace and avoid
polluting the main namespace. Then I needed a "master function" to select which
of the methods should be called, with all the additional testing and documentation that
entailed.
- The API doesn't really feel very Pythonic to me. For example, we write:
mystring.rjust(width)
dict.items()
These are methods on objects, the result of these calls depends on the value of
'self' argument, not merely its class. No so with a median singleton.
We also have len(seq) and copy.copy(obj)
No classes required.
rather than mystring.justify(width, "right") or dict.iterate("items"). So I
think individual methods is a better API, and one which is more familiar to most Python users. The
only innovation (if
that's what it is) is to have median a callable object.
As far as having four separate functions, median, median_low, etc., it just
doesn't feel right to me. It puts four slight variations of the same function
into the main namespace, instead of keeping
them together in a namespace. Names like median_low merely simulates a
namespace with pseudo-methods separated with underscores instead of dots, only
without the advantages of a real namespace.
(I treat variance and std dev differently, and make the sample and population
forms separate top-level functions rather than methods, simply because they are
so well-known from scientific calculators
that it is unthinkable to me to do differently. Whenever I use numpy, I am
surprised all over again that it has only a single variance function.)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 450 adding statistics module
On 16/08/13 17:47, Oscar Benjamin wrote: I can't think of a situation where 1 or 2 modes are acceptable but 3 is not. The only forms I can imagine using are mode(data) to get the unique mode if it exists and mode(data, max_modes=None) to get the set of all modes. Hmmm, I think you are right. The current design is leftover from when mode also supported continuous data, and it made more sense there. But for that usage it would be better to have a boolean flag and then either way you're at the point where it would normally become two functions. Alright, you've convinced me. I'll provide two functions: mode, which returns the single value with the highest frequency, or raises; and a second function, which collates the data into a sorted (value, frequency) list. Bike-shedding on the name of this second function is welcomed :-) -- Steven ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 450 adding statistics module
On Aug 16, 2013 11:05 AM, "Steven D'Aprano" wrote: > > I'll provide two functions: mode, which returns the single value with the highest frequency, or raises; and a second function, which collates the data into a sorted (value, frequency) list. Bike-shedding on the name of this second function is welcomed :-) I'd call it counts() and prefer an OrderedDict for easy lookup. By that point you're very close to Counter though (which it currently uses internally). Oscar ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Summary of Python tracker Issues
ACTIVITY SUMMARY (2013-08-09 - 2013-08-16) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open4152 ( +4) closed 26377 (+56) total 30529 (+60) Open issues with patches: 1896 Issues opened (51) == #17477: update the bsddb module do build with db 5.x versions http://bugs.python.org/issue17477 reopened by jcea #18693: help() not helpful with enum http://bugs.python.org/issue18693 reopened by ethan.furman #18697: Unify arguments names in Unicode object C API documentation http://bugs.python.org/issue18697 opened by serhiy.storchaka #18699: What is Future.running() for in PEP 3148 / concurrent.futures. http://bugs.python.org/issue18699 opened by gvanrossum #18701: Remove outdated PY_VERSION_HEX checks http://bugs.python.org/issue18701 opened by serhiy.storchaka #18702: Report skipped tests as skipped http://bugs.python.org/issue18702 opened by serhiy.storchaka #18703: To change the doc of html/faq/gui.html http://bugs.python.org/issue18703 opened by madan.ram #18704: IDLE: PEP8 Style Check Integration http://bugs.python.org/issue18704 opened by JayKrish #18705: Fix typos/spelling mistakes in Lib/*.py files http://bugs.python.org/issue18705 opened by iwontbecreative #18706: test failure in test_codeccallbacks http://bugs.python.org/issue18706 opened by pitrou #18707: the readme should also talk about how to build doc. http://bugs.python.org/issue18707 opened by madan.ram #18709: SSL module fails to handle NULL bytes inside subjectAltNames g http://bugs.python.org/issue18709 opened by christian.heimes #18710: Add PyState_GetModuleAttr http://bugs.python.org/issue18710 opened by pitrou #18711: Add PyErr_FormatV http://bugs.python.org/issue18711 opened by pitrou #18712: Pure Python operator.index doesn't match the C version. http://bugs.python.org/issue18712 opened by mark.dickinson #18713: Enable surrogateescape on stdin and stdout when appropriate http://bugs.python.org/issue18713 opened by ncoghlan #18714: Add tests for pdb.find_function http://bugs.python.org/issue18714 opened by kevinjqiu #18715: Tests fail when run with coverage http://bugs.python.org/issue18715 opened by seydou #18716: Deprecate the formatter module http://bugs.python.org/issue18716 opened by brett.cannon #18717: test for request.urlretrieve http://bugs.python.org/issue18717 opened by mjehanzeb #18718: datetime documentation contradictory on leap second support http://bugs.python.org/issue18718 opened by wolever #18720: Switch suitable constants in the socket module to IntEnum http://bugs.python.org/issue18720 opened by eli.bendersky #18723: shorten function of textwrap module is susceptible to non-norm http://bugs.python.org/issue18723 opened by vajrasky #18725: Multiline shortening http://bugs.python.org/issue18725 opened by serhiy.storchaka #18726: json functions have too many positional parameters http://bugs.python.org/issue18726 opened by serhiy.storchaka #18727: test for writing dictionary rows to CSV http://bugs.python.org/issue18727 opened by mjehanzeb #18728: Increased test coverage for filecmp.py http://bugs.python.org/issue18728 opened by Alex.Volkov #18729: In unittest.TestLoader.discover doc select the name of load_te http://bugs.python.org/issue18729 opened by py.user #18730: suffix parameter in NamedTemporaryFile silently fails when not http://bugs.python.org/issue18730 opened by dloewenherz #18731: Increased test coverage for uu and telnet http://bugs.python.org/issue18731 opened by Alex.Volkov #18733: elementtree: stop the parser more quickly on error http://bugs.python.org/issue18733 opened by haypo #18734: Berkeley DB versions 4.4-4.9 are not discovered by setup.py http://bugs.python.org/issue18734 opened by Eddie.Stanley #18736: Invalid charset in HTML pages inside documentation in CHM form http://bugs.python.org/issue18736 opened by grv87 #18737: Get virtual subclasses of an ABC http://bugs.python.org/issue18737 opened by christian.heimes #18738: String formatting (% and str.format) issues with Enum http://bugs.python.org/issue18738 opened by ethan.furman #18739: math.log of a long returns a different value of math.log of an http://bugs.python.org/issue18739 opened by gregory.p.smith #18741: Fix typos/spelling mistakes in Lib/*/*/.py files http://bugs.python.org/issue18741 opened by iwontbecreative #18742: Abstract base class for hashlib http://bugs.python.org/issue18742 opened by christian.heimes #18743: References to non-existant "StringIO" module http://bugs.python.org/issue18743 opened by jcea #18744: pathological performance using tarfile http://bugs.python.org/issue18744 opened by teamnoir #18745: Test enum in test_json is ignorant of infinity value http://bugs.python.org/issue18745 opened by vajrasky #18746: test_threading.test_finalize_with_trace() fails on FreeBSD bui ht
