I have a similar problem, with characters like äöüAÖÜß and so on. I am extracting some content out of webpages, and they deliver whatever, sometimes not even giving any encoding information in the header. But your solution sounds quite good, i just do not know ifDylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII characters, and character references for everything else.
Now, how should you guess the encoding? Here is a strategy: 1. use the encoding that was sent through the HTTP header. Be absolutely certain to not ignore this encoding. 2. use the encoding in the XML declaration (if any). 3. use the encoding in the http-equiv meta element (if any) 4. use UTF-8 5. use Latin-1, and check that there are no characters in the range(128,160) 6. use cp1252 7. use Latin-1
In the order from 1 to 6, check whether you manage to decode the input. Notice that in step 5, you will definitely get successful decoding; consider this a failure if you have get any control characters (from range(128, 160)); then try in step 7 latin-1 again.
When you find the first encoding that decodes correctly, encode it with ascii and xmlcharrefreplace, and you won't need to worry about the encoding, anymore.
Regards, Martin
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode() or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris
--
http://mail.python.org/mailman/listinfo/python-list