In article <[EMAIL PROTECTED]>,
 "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote:

> > I started out working in the context of elementtidy, but now I am
> > running into trouble in general Python-XML areas, so I thought I'd toss
> > the question out here. The code below is fairly self-explanatory. I have
> > a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
> > compatible. I use Tidy to convert it to XHTML, and this particular setup
> > returns a unicode instance rather than a string.
> > 
> > import _elementtidy as et
> > from xml.parsers import expat
> > 
> > data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
> > html = et.fixup(data)[0]
> > parser = expat.ParserCreate()
> > parser.Parse(html)
> > 
> > UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
> > position 542: ordinal not in range(128)
> > 
> > If I set my default encoding to utf8 in sitecustomize.py, it works just
> > fine. I'm thinking that I can't be the only one trying to pass unicode
> > to expat... Is there something else I need to do here?
> 
> you confuse unicode with utf8. Expat can parse the latter - the former is
> internal to python. And passing it to something that needs a string will
> result in a conversion - which fails because of the ascii encoding.
> 
> Do this:
> 
> parser.Parse(html.encode('utf-8'))

Possibly preceded by

  parser = expat.ParserCreate('utf-8')

..so there's no confusion with the declared encoding, in case that's not 
utf-8.

Just
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to