kj wrote:

Some people have mathphobia.  I'm developing a wicked case of
Unicodephobia.

I have read a *ton* of stuff on Unicode.  It doesn't even seem all
that hard.  Or so I think.  Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal 
not in range(128)

(There, see?  My Unicodephobia just went up a notch.)

Here's the thing: I don't even know how to *begin* debugging errors
like this.  This is where I could use some help.

In the past I've gone for method of choice of the clueless:
"programming by trial-and-error", try random crap until something
"works."  And if that "strategy" fails, I come begging for help to
c.l.p.  And thanks for the very effective pointers for getting rid
of the errors.

But afterwards I remain as clueless as ever...  It's the old "give
a man a fish" vs. "teach a man to fish" story.

I need a systematic approach to troubleshooting and debugging these
Unicode errors.  I don't know what.  Some tools maybe.  Some useful
modules or builtin commands.  A diagnostic flowchart?  I don't
think that any more RTFM on Unicode is going to help (I've done it
in spades), but if there's a particularly good write-up on Unicode
debugging, please let me know.

Any suggestions would be much appreciated.

FWIW, I'm using Python 2.6.  The example above happens to come from
a script that extracts data from HTML files, which are all in
English, but they are a daily occurrence when I write code to
process non-English text.  The script uses Beautiful Soup.  I won't
post a lot of code because, as I said, what I'm after is not so
much a way around this specific error as much as the tools and
techniques to troubleshoot it and fix it on my own.  But to ground
the problem a bit I'll say that the exception above happens during
the execution of a statement of the form:

  x = '%s %s' % (y, z)

Also, I found that, with the exact same values y and z as above,
all of the following statements work perfectly fine:

  x = '%s' % y
  x = '%s' % z
  print y
  print z
  print y, z

Decode all text input; encode all text output; do all text processing
in Unicode, which also means making all text literals Unicode (prefixed
with 'u').

Note: I'm talking about when you're working with _text_, as distinct
from when you're working with _binary data_, ie bytes.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to