On 3/5/20 9:10 AM, Steven D'Aprano wrote:
On Thu, Mar 05, 2020 at 08:23:22AM -0500, Richard Damon wrote:
Yes, that is the idea of AlmostTotalOrder, to have algorithms that
really require a total order (like sorting)
Sorting doesn't require a total order. Sorting only requires a weak
order where the only operator required is the "comes before" operator,
or less than. That's precisely how sorting in Python is implemented.
Here is an interesting discussion of a practical use-case of sorting
data with a partial order:
https://blog.thecybershadow.net/2018/11/18/d-compilation-is-too-slow-and-i-am-forking-the-compiler/
Reading that, yes, there are applications of sorting that don't need
total order, but as the article points out, many of the general purpose
sorting algorithms do (like the one that Python uses in sort)
but we really need to use a
type that has these exceptional values. Imagine that sort/median was
defined to type check its parameter,
No need to imagine it, sort already type-checks its arguments:
py> sorted([1, 3, 5, "Hello", 2])
TypeError: '<' not supported between instances of 'str' and 'int'
If you consider that proper type checking, then you must consider that
the proper answer for the median of a list of numbers that contain a NaN
is any of the numbers in the list. If Sort had an easy/cheap way to
confirm that values passed to it met its assumptions, then it could make
are reasonable response.
and that meant that you couldn't
take the median of a list of floats (because float has the NaN value
that breaks TotalOrder).
Dealing with NANs depends on what you want to do with the data. If you
are sorting for presentation purposes, what you probably want is to sort
with a custom key that pushes all the NANs to the front (or rear) of the
list. If you are sorting for the purposes of calculating the median, it
depends. There are at least three reasonable strategies for median:
- ignore the NANs;
- return a NAN;
- raise an exception.
Personally, I think that the first is by far the most practical: if you
have NANs in your statistical data, that's probably because they've
come from some other library or application that is using them to
represent missing values, and if that's the case, the right thing to do
is to ignore them.
There was not that long ago about that very topic. All those options can
be reasonable, but ignoring seems to me to be one of the worse options
for a simple package (but reasonable for one where the whole package
uses that convention). The danger of it is that if you get a NaN as a
result of a computation generating your data, that error gets hidden by
having the data just be ignored. I would say that in Python, it would
make a lot more sense to use None as the missing data code, and leave
NaN for invalid data/computations. That way you keep things explicit.
The use of NaN here goes back to the use of strictly static typed
languages for doing this, where NaN was a convenient special value to
mark it. (prior to the invention of NaN you just used an impossible
value for these).
--
Richard Damon
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/S2OY6FFW32JP2ACQFQ4645NGYP4ZZKQT/
Code of Conduct: http://python.org/psf/codeofconduct/