kj wrote:
Some people have mathphobia. I'm developing a wicked case of Unicodephobia. I have read a *ton* of stuff on Unicode. It doesn't even seem all that hard. Or so I think. Then I start writing code, and WHAM: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) (There, see? My Unicodephobia just went up a notch.) Here's the thing: I don't even know how to *begin* debugging errors like this. This is where I could use some help. In the past I've gone for method of choice of the clueless: "programming by trial-and-error", try random crap until something "works." And if that "strategy" fails, I come begging for help to c.l.p. And thanks for the very effective pointers for getting rid of the errors. But afterwards I remain as clueless as ever... It's the old "give a man a fish" vs. "teach a man to fish" story. I need a systematic approach to troubleshooting and debugging these Unicode errors. I don't know what. Some tools maybe. Some useful modules or builtin commands. A diagnostic flowchart? I don't think that any more RTFM on Unicode is going to help (I've done it in spades), but if there's a particularly good write-up on Unicode debugging, please let me know. Any suggestions would be much appreciated. FWIW, I'm using Python 2.6. The example above happens to come from a script that extracts data from HTML files, which are all in English, but they are a daily occurrence when I write code to process non-English text. The script uses Beautiful Soup. I won't post a lot of code because, as I said, what I'm after is not so much a way around this specific error as much as the tools and techniques to troubleshoot it and fix it on my own. But to ground the problem a bit I'll say that the exception above happens during the execution of a statement of the form: x = '%s %s' % (y, z) Also, I found that, with the exact same values y and z as above, all of the following statements work perfectly fine: x = '%s' % y x = '%s' % z print y print z print y, z
Decode all text input; encode all text output; do all text processing in Unicode, which also means making all text literals Unicode (prefixed with 'u'). Note: I'm talking about when you're working with _text_, as distinct from when you're working with _binary data_, ie bytes. -- http://mail.python.org/mailman/listinfo/python-list