In article <[EMAIL PROTECTED]>, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote:
> > I started out working in the context of elementtidy, but now I am > > running into trouble in general Python-XML areas, so I thought I'd toss > > the question out here. The code below is fairly self-explanatory. I have > > a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII > > compatible. I use Tidy to convert it to XHTML, and this particular setup > > returns a unicode instance rather than a string. > > > > import _elementtidy as et > > from xml.parsers import expat > > > > data = unicode(open("snippetWithUnicode.html").read(), "utf-8") > > html = et.fixup(data)[0] > > parser = expat.ParserCreate() > > parser.Parse(html) > > > > UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in > > position 542: ordinal not in range(128) > > > > If I set my default encoding to utf8 in sitecustomize.py, it works just > > fine. I'm thinking that I can't be the only one trying to pass unicode > > to expat... Is there something else I need to do here? > > you confuse unicode with utf8. Expat can parse the latter - the former is > internal to python. And passing it to something that needs a string will > result in a conversion - which fails because of the ascii encoding. > > Do this: > > parser.Parse(html.encode('utf-8')) Possibly preceded by parser = expat.ParserCreate('utf-8') ..so there's no confusion with the declared encoding, in case that's not utf-8. Just -- http://mail.python.org/mailman/listinfo/python-list