On Wed, 23 Feb 2011 04:14:29 -0800, Chris Rebert wrote: >> Ok, but that the interface handles UTF-8 strings >> are still ok? The defaultencoding is still ascii. > > Yes, that's fine. UTF-8 is an excellent encoding choice, and > encoding/decoding should always be done explicitly in Python, so the > "default encoding" ideally ought to never come into play (and indeed, > Python 3 does away with bug-prone implicit encoding/decoding entirely > FWICT).
On Unix, you have to go out of your way to avoid the use of implicit encoding/decoding with the "filesystem" encoding. This is because Unix extensively uses byte strings with no associated encoding, but Python 3 tries to use Unicode for everything. 3.0 was essentially unusable as a Unix scripting language for this reason, as argv and environ were converted to Unicode, with no possibility of recovering from unconvertible sequences. 3.1 added the surrogate-escape mechanism which allows recovery of the original byte sequences, albeit with some effort (i.e. you had to explicitly decode os.environ and sys.argv). 3.2 adds os.environb (bytes version of os.environ), but it appears that sys.argv still has to be encoded manually. It also provides os.fsencode() and os.fsdecode() to simplify the conversion. Most functions accept bytes arguments, most either return bytes when passed bytes or (if the function accepts no arguments) has a bytes equivalent. But variables tend to be Unicode strings with no bytes version (os.environb is the exception rather than the rule), and some functions have no bytes equivalent (e.g. os.ctermid(), os.uname(), os.ttyname(); fortunately it's rather unlikely that the result from any of these functions will contain non-ASCII characters). -- http://mail.python.org/mailman/listinfo/python-list