Re: character encoding conversion

"Martin v. Löwis" Mon, 13 Dec 2004 15:05:05 -0800

Christian Ergh wrote:

Once more, indention should be correct now, and the 128 is gone too. So, something like this?


Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.

Also, it might be possible to do this in a for loop, e.g.

for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
                 "UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
    try:
       data = data.encode(encoding)
       break;
    except UnicodeError:
       pass

You then just need to add the Latin-1-no-controls codec, or you need
to special-case this in the loop.

# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'


You need to remember the HTTP connection that you got the HTML file
from. The webserver may have sent a Content-Type header.

xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'


Depending on the library you use, these aren't that trivial, either.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Re: character encoding conversion

Reply via email to