On Thu, 02 Dec 2010 12:17:53 +0100, Peter Otten wrote: >> This was actually a critical flaw in Python 3.0, as it meant that >> filenames which weren't valid in the locale's encoding simply couldn't be >> passed via argv or environ. 3.1 fixed this using the "surrogateescape" >> encoding, so now it's only an annoyance (i.e. you can recover the original >> bytes once you've spent enough time digging through the documentation). > > Is it just that you need to harden your scripts against these byte sequences > or do you actually encounter them? If the latter, can you give some > examples?
Assume that you have a Python3 script which takes filenames on the command-line. If any of the filenames contain byte sequences which aren't valid in the locale's encoding, the bytes will be decoded to characters in the range U+DC00 to U+DCFF. To recover the original bytes, you need to use 'surrogateescape' as the error handling method when decoding, e.g.: enc = sys.getfilesystemencoding() argv_bytes = [arg.encode(enc, 'surrogateescape') for arg in sys.argv] Otherwise, it will complain about not being able to encode the surrogate characters. Similarly for os.environ. For anything else, you can just use sys.setfilesystemencoding('iso-8859-1') at the beginning of the script. Decoding as ISO-8859-1 will never fail, and encoding as ISO-8859-1 will give you the original bytes. But argv and environ are decoded before your script can change the encoding, so you need to know the "trick" to undo them if you want to write a robust Python 3 script which works with byte strings in an encoding-agnostic manner (i.e. a traditional Unix script). -- http://mail.python.org/mailman/listinfo/python-list