In <402ac982-0750-4977-adb2-602b19149...@m24g2000prn.googlegroups.com> Jonathan Gardner <jgard...@jonathangardner.net> writes:
>On Feb 10, 11:09=A0am, kj <no.em...@please.post> wrote: >> FWIW, I'm using Python 2.6. =A0The example above happens to come from >> a script that extracts data from HTML files, which are all in >> English, but they are a daily occurrence when I write code to >> process non-English text. =A0The script uses Beautiful Soup. =A0I won't >> post a lot of code because, as I said, what I'm after is not so >> much a way around this specific error as much as the tools and >> techniques to troubleshoot it and fix it on my own. =A0But to ground >> the problem a bit I'll say that the exception above happens during >> the execution of a statement of the form: >> >> =A0 x =3D '%s %s' % (y, z) >> >> Also, I found that, with the exact same values y and z as above, >> all of the following statements work perfectly fine: >> >> =A0 x =3D '%s' % y >> =A0 x =3D '%s' % z >> =A0 print y >> =A0 print z >> =A0 print y, z >> >What are y and z? x = "%s %s" % (table['id'], table.tr.renderContents()) where the variable table represents a BeautifulSoup.Tag instance. >Are they unicode or strings? The first item (table['id']) is unicode, and the second is str. >What are their values? The only easy way I know to examine the values of these strings is to print them, which, I know, is very crude. (IOW, to answer this question usefully, in the context of this problem, more Unicode knowhow is needed than I have.) If I print them, the output for the first one on my screen is "mainTable", and for the second it is <th class="mainTableHeader" colspan="2"> Tags</th> <th class="mainTableHeader"> Id</th> >It sounds like someone, probably beautiful soup, is trying to turn >your strings into unicode. A full stacktrace would be useful to see >who did what where. Unfortunately, there's not much in the stacktrace: Traceback (most recent call last): File "./download_tt.py", line 427, in <module> x = "%s %s" % (table['id'], table.tr.renderContents()) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 41: ordinal not in range(128) (NB: the difference between this error message and the one I originally posted, namely the position of the unrecognized byte, is because I simplified the code for the purpose of posting it here, eliminating one additional processing of the second entry of the tuple above.) ~K -- http://mail.python.org/mailman/listinfo/python-list