Tim Arnold schrieb: > Hi, I'm getting the by-now-familiar error: > return codecs.charmap_decode(input,errors,decoding_map) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position > 4615: ordinal not in range(128) > > the html file I'm working with is in utf-8, I open it with codecs, try to > feed it to TidyHTMLTreeBuilder, but no luck. Here's my code: > from elementtree import ElementTree as ET > from elementtidy import TidyHTMLTreeBuilder > > fd = codecs.open(htmfile,encoding='utf-8') > tidyTree = > TidyHTMLTreeBuilder.TidyHTMLTreeBuilder(encoding='utf-8') > tidyTree.feed(fd.read()) > self.tree = tidyTree.close() > fd.close() > > what am I doing wrong? Thanks in advance.
Being to clever for your own good.. sorry to say so. But TidyHTMLTreeBuilder takes the encoding for a reason: it expects a byte-string that it will decode itself. But you decode first, creating a unicode-object. When feeding that to the string-expecting feed-method, python attempts a conversion to a byte-string using the default-encoding. Not using codecs but a file instead should do the trick. diez -- http://mail.python.org/mailman/listinfo/python-list