On Wed, 2010-02-10 at 12:17 -0800, Anthony Tolle wrote: > On Feb 10, 2:09 pm, kj <no.em...@please.post> wrote: > > Some people have mathphobia. I'm developing a wicked case of > > Unicodephobia. > > [snip] > > Some general advice (Looks like I am reiterating what MRAB said -- I > type slower :): > > 1. If possible, use unicode strings for everything. That is, don't > use both str and unicode within the same project. > > 2. If that isn't possible, convert strings to unicode as early as > possible, work with them that way, then convert them back as late as > possible. > > 3. Know what type of string you are working with! If a function > returns or accepts a string value, verify whether the expected type is > unicode or str. > > 4. Consider switching to Python 3.x, since there is only one string > type (unicode).
Some further nasty gotchas: 5. Be wary of the encoding of sys.stdout (and stderr/stdin), e.g. when issuing a "print" statement: they can change on Unix depending on whether the python process is directly connected to a tty or not. (a) If they're directly connected to a tty, their encoding is taken from the locale, UTF-8 on my machine: [da...@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' αβγ (prints alpha, beta, gamma to terminal, though these characters might not survive being sent in this email) (b) If they're not (e.g. cronjob, daemon, within a shell pipeline, etc) their encoding is the default encoding, which is typically ascii; rerunning the same command, but piping into "cat": [da...@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'| cat Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) (c) These problems can lurk in sources and only manifest themselves during _deployment_ of code. You can set PYTHONIOENCODING=ascii in the environment to force (a) to behave like (b), so that your code will fail whilst you're _developing_ it, rather than on your servers at midnight: [da...@brick ~]$ PYTHONIOENCODING=ascii python -c 'print u"\u03b1\u03b2 \u03b3"' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) (Given the above, it could be argued perhaps that one should never "print" unicode instances, and instead should write the data to file-like objects, specifying an encoding. Not sure). 6. If you're using pygtk (specifically the "pango" module, typically implicitly imported), be warned that it abuses the C API to set the default encoding inside python, which probably breaks any unicode instances in memory at the time, and is likely to cause weird side effects: [da...@brick ~]$ python Python 2.6.2 (r262:71600, Jan 25 2010, 13:22:47) [GCC 4.4.2 20100121 (Red Hat 4.4.2-28)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'ascii' >>> import pango >>> sys.getdefaultencoding() 'utf-8' (the above is on Fedora 12, though I'd expect to see the same weirdness on any linux distro running gnome 2) Python 3 will probably make this all much easier; you'll still have to care about encodings when dealing with files/sockets/etc, but it should be much more clear what's going on. I hope. Hope this is helpful Dave -- http://mail.python.org/mailman/listinfo/python-list