Re: Detect character encoding

Martin v. Löwis Sun, 04 Dec 2005 23:00:48 -0800

Martin P. Hellwig wrote:
>  From what I can remember is that they used an algorithm to create some 
> statistics of the specific page and compared that with statistic about 
> all kinds of languages and encodings and just mapped the most likely.


More hearsay: I believe language-based heuristics are common. You first
guess an encoding based on the bytes you see, then guess a language of 
the page. If you then get a lot of characters that should not appear
in texts of the language (e.g. a lot of umlaut characters in a French
page), you know your guess was wrong, and you try a different language
for that encoding. If you run out of languages, you guess a different
encoding.

Mozilla can guess the encoding if you tell it what the language is,
which sounds like a similar approach.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

Reply via email to