On Sat, 13 Dec 2008 14:09:04 -0800, John Machin wrote: > On Dec 14, 8:07 am, "Chris Rebert" <c...@rebertia.com> wrote: >> On Sat, Dec 13, 2008 at 12:28 PM, John Machin <sjmac...@lexicon.net> >> wrote: >> >> > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit >> > (Intel)] on win32 >> > Type "help", "copyright", "credits" or "license" for more >> > information. >> >>>> x = u'\u9876' >> >>>> x >> > u'\u9876' >> >> > # As expected >> >> > Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit >> > (Intel)] on win 32 >> > Type "help", "copyright", "credits" or "license" for more >> > information. >> >>>> x = '\u9876' >> >>>> x >> > Traceback (most recent call last): >> > File "<stdin>", line 1, in <module> >> > File "C:\python30\lib\io.py", line 1491, in write >> > b = encoder.encode(s) >> > File "C:\python30\lib\encodings\cp850.py", line 19, in encode >> > return codecs.charmap_encode(input,self.errors,encoding_map)[0] >> > UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' >> > in position >> > 1: character maps to <undefined> >> >> > # *NOT* as expected (by me, that is) >> >> > Is this the intended outcome? >> >> When Python tries to display the character, it must first encode it >> because IO is done in bytes, not Unicode codepoints. When it tries to >> encode it in CP850 (apparently your system's default encoding judging >> by the traceback), it unsurprisingly fails (CP850 is an old Western >> Europe codec, which obviously can't encode an Asian character like the >> one in question). To signal that failure, it raises an exception, thus >> the error you see. >> This is intended behavior. > > I see. That means that the behaviour in Python 1.6 to 2.6 (i.e. encoding > the text using the repr() function (as then defined) was not intended > behaviour? > >> Either change your default system/terminal encoding to one that can >> handle such characters or explicitly encode the string and use one of >> the provided options for dealing with unencodable characters. > > You are missing the point. I don't care about the visual representation. > What I care about is an unambiguous representation that can be used when > communicating about problems across cultures/ > networks/mail-clients/news-readers ... the sort of problems that are > initially advised as "I got this UnicodeEncodeError" and accompanied by > no data or garbled data.
Python defaulted to using strict encoding, which means to throw errors on unencodable characters, but this is NOT the only behavior, you can change the behavior to "replace using placeholder character" or "ignore any errors and discard unencodable characters" | errors can be 'strict', 'replace' or 'ignore' and defaults | to 'strict'. If you don't like the default behavior or you want another kind of behavior, you're welcome to file a bug report at http://bugs.python.org >> Also, please don't call it a "crash" as that's very misleading. The >> Python interpreter didn't dump core, an exception was merely thrown. > > "spew nonsense on the screen and then stop" is about as useful and as > astonishing as "dump core". That's an interesting definition of crash. You're just like saying: "C has crashed because I made a bug in my program". In this context, it is your program that crashes, not python nor C, it is misleading to say so. It will be python's crash if: 1. Python 'segfault'ed 2. Python interpreter exits before there is instruction to exit (either implicit (e.g. falling to the last line of the script) or explicit (e.g sys.exit or raise SystemExit)) 3. Python core dumped 4. Python does something that is not documented -- http://mail.python.org/mailman/listinfo/python-list