This is a followup to a blog post I wrote the other day http://www.blueskyonmars.com/archives/2005/01/31/using_unicode_with_elementtidy.html
I started out working in the context of elementtidy, but now I am running into trouble in general Python-XML areas, so I thought I'd toss the question out here. The code below is fairly self-explanatory. I have a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII compatible. I use Tidy to convert it to XHTML, and this particular setup returns a unicode instance rather than a string.
import _elementtidy as et from xml.parsers import expat
data = unicode(open("snippetWithUnicode.html").read(), "utf-8") html = et.fixup(data)[0] parser = expat.ParserCreate() parser.Parse(html)
UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in position 542: ordinal not in range(128)
If I set my default encoding to utf8 in sitecustomize.py, it works just fine. I'm thinking that I can't be the only one trying to pass unicode to expat... Is there something else I need to do here?
Thanks, Kevin Blazing Things -- http://mail.python.org/mailman/listinfo/python-list