On Fri, 18 Jul 2014 01:36:24 +1000, Chris Angelico wrote: > On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibsch...@gmail.com> > wrote: >> Well, I just spotted this thread. An easy example is, well, pretty >> much any case where SQL NULL would be useful. Say I have lists of >> borrowers, the amount owed, and the amount they paid so far. >> >> nan = float("nan") >> borrowers = ["Alice", "Bob", "Clem", "Dan"] amount_owed = [100.0, >> nan, 200.0, 300.0] amount_paid = [100.0, nan, nan, 200.0] >> who_paid_off = [b for (b, ao, ap) in >> zip(borrowers, amount_owed, amount_paid) >> if ao == ap] >> >> I want to just get Alice from that list, not Bob. I don't know how >> much Bow owes or how much he's paid, so I certainly don't know that >> he's paid off his loan. >> >> > But you also don't know that he hasn't. NaN doesn't mean "unknown", it > means "Not a Number". You need a more sophisticated system that allows > for uncertainty in your data. I would advise using either None or a > dedicated singleton (something like `unknown = object()` would work, or > you could make a custom type with a more useful repr)
Hmmm, there's something to what you say there, but IEEE-754 NANs seem to have been designed to do quadruple (at least!) duty with multiple meanings, including: - Missing values ("I took a reading, but I can't read my handwriting"). - Data known only qualitatively, not quantitatively (e.g. windspeed = "fearsome"). - Inapplicable values, e.g. the average depth of the oceans on Mars. - The result of calculations which are mathematically indeterminate, such as 0/0. - The result of real-valued calculations which are invalid due to domain errors, such as sqrt(-1) or acos(2.5). - The result of calculations which are conceptually valid, but are unknown due to limitations of floats, e.g. you have two finite quantities which have both overflowed to INF, the difference between them ought to be finite, but there's no way to tell what it should be. It seems to me that the way you treat a NAN will often depend on which category it falls under. E.g. when taking the average of a set of values, missing values ought to be skipped over, while actual indeterminate NANs ought to carry through: average([1, 1, 1, Missing, 1]) => 1 average([1, 1, 1, 0/0, 1]) => NAN I know that R distinguishes between NA and IEEE-754 NANs, although I'm not sure how complete its support for NANs is. But many (most?) R functions take an argument controlling whether or not to ignore NA values. In principle, you can encode the different meanings into NANs using the payload. There are 9007199254740988 possible Python float NANs. Half of these are signalling NANs, half are quiet NANs. Ignoring the sign bit leaves us with 2251799813685247 distinct sNANs and the same qNANs. That's enough to encode a *lot* of different meanings. [Aside: I find myself perplexed why IEEE-754 says that the sign bit of NANs should be ignored, but then specifies that another bit is to be used to distinguish signalling from quiet NANs. Why not just interpret NANs with the sign bit set are signalling, those with it clear are quiet?] -- Steven -- https://mail.python.org/mailman/listinfo/python-list