Terry Reedy writes: >On 10/8/2010 9:45 AM, Hallvard B Furuseth wrote: >>> Actually, the implicit contract of __str__ is that it never fails, so >>> that everything can be printed out (for debugging purposes, etc.). >> >> Nope: >> >> $ python2 -c 'str(u"\u1000")' >> Traceback (most recent call last): >> File "<string>", line 1, in ? >> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in >> position 0: ordinal not in range(128) > > This could be considered a design bug due to 'str' being used both to > produce readable string representations of objects (perhaps one that > could be eval'ed) and to convert unicode objects to equivalent string > objects. which is not the same operation!
Indeed, the eager str() and the lack of a more narrow str function is one root of the problem. I'd put it more more generally: Converting an object which represents a string, to an actual str. *And* __str__ may be intended for Python-independent representations like 23 -> "23". I expect that's why quite a bit of code calls str() just in case, which is another root of the problem. E.g. urlencode(), as I said. The code might not need to, but str('string') is a noop so it doesn't hurt. Maybe that's why %s does too, instead of demanding that the user calls str() if needed. > The above really should have produced '\u1000'! (the equivavlent of what > str(bytes) does today). The 'conversion to equivalent str object' option > should have required an explicit encoding arg rather than defaulting to > the ascii codec. This mistake has been corrected in 3.x, so Yep. If there were a __plain_str__() method which was supposed to fail rather than start to babble Python syntax, and if there were not plenty of Python code around which invoked __str__, I'd agree. As it is, this "correction" instead is causing code which previously produced the expected non-Python-related string output, to instead produce Pythonesque repr() stuff. See below. >> And the equivalent: >> >> $ python2 -c 'unicode("\xA0")' >> Traceback (most recent call last): >> File "<string>", line 1, in ? >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: >> ordinal not in range(128) > > This is an application bug: either bad string or missing decoding arg. Exactly. And Python 2 caught the bug. (Since I had Ascii default decoding, I'd forgotten Python could pick another default.) For an app which handles Unicode vs. raw bytes, the equivalent Python 3 code is str(b"\xA0"). That's the *same* application bug, in equivalent application code, and Python 3 does not catch it. This time the bug is spelled str() instead, which is much more likely than old unicode() to happen somewhere thanks to the str()-related misdesign discussed above. Article <hbf.20101008c...@bombur.uio.no> in this thread has an example. And that's the third root of the problem above. Technically it's the same problem that an application bug can do str(None) where it should be using a string, and produce garbage text. The difference is that Python forces programs to deal with these two different character/octet string types, sometimes swapping back and forth between them. And it's not necessarily obvious from the code which type is in use where. Python 3 has not changed that, it has strengthened it by removing the default conversion. Yet while the programmer now needs to be _more_ careful about this before, Python 3 has removed the exception which caught this particular bug instead of doing something to make it easier to find such bugs. That's why I suggested making bytes.__str__ fail by default, annoying as it would be. But I don't know how annoying it'd be. Maybe there could be an option to disable it. >> In Python 2, these two UnicodeEncodeErrors made our data safe from code >> which used str and unicode objects without checking too carefully which >> was which. Code which sort the types out carefully enough would fail. >> >> In Python 3, that safety only exists for bytes(str), not str(bytes). > > If you prefer the buggy 2.x design (and there are *many* tracker bug > reports that were fixed by the 3.x change), stick with it. Bugs even with ASCII default encoding? Looking closer at setencoding() in site.py, it doesn't seem to do anything, it's "if 0"ed out. As I think I've made clear, I certainly don't feel like entrusting Python 3 with my raw string data just yet. -- Hallvard -- http://mail.python.org/mailman/listinfo/python-list