On Wed, Feb 10, 2010 at 1:03 PM, kj <no.em...@please.post> wrote: > In <402ac982-0750-4977-adb2-602b19149...@m24g2000prn.googlegroups.com> Jonathan Gardner <jgard...@jonathangardner.net> writes: <huge snip> >>It sounds like someone, probably beautiful soup, is trying to turn >>your strings into unicode. A full stacktrace would be useful to see >>who did what where. > > Unfortunately, there's not much in the stacktrace: > > Traceback (most recent call last): > File "./download_tt.py", line 427, in <module > > x = "%s %s" % (table['id'], table.tr.renderContents()) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 41: ordinal not in range(128)
Think I've found the problem. According to the BeautifulSoup docs, renderContents() returns a (by default, UTF-8-encoded) str [i.e. byte sequence] as opposed to unicode[i.e. abstract code point sequence] . Thus, as was said previously, you're combining the unicode from table['id']and the str from renderContents(), so Python tries to automatically+implicitly convert the str to unicode by decoding it as ASCII. However, it's not ASCII but UTF-8, hence you get the error about it having non-ASCII bytes. Solution: Convert the output of renderContents() back to unicode. x = u"%s %s" % (table['id'], table.tr.renderContents().decode('utf8')) Now only unicode objects are being combined. Your problem is particularly ironic considering how well BeautifulSoup handles Unicode overall; I was unable to locate a renderContents() equivalent that returned unicode. Cheers, Chris -- http://blog.rebertia.com
-- http://mail.python.org/mailman/listinfo/python-list