Re: character encoding conversion

Christian Ergh Sun, 12 Dec 2004 12:14:35 -0800

Martin v. Löwis wrote:

Dylan wrote:

Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

  htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
   absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
   range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.

Regards,
Martin

I have a similar problem, with characters like äöüAÖÜß and so on. I am extracting some content out of webpages, and they deliver whatever, sometimes not even giving any encoding information in the header. But your solution sounds quite good, i just do not know if - it works with the characters i mentioned - what encoding do you have in the end - and how exactly are you doing all this? All with somestring.decode() or... Can you please give an example for these 7 steps? Thanx in advance for the help Chris -- http://mail.python.org/mailman/listinfo/python-list

Re: character encoding conversion

Reply via email to