Martin P. Hellwig wrote: > From what I can remember is that they used an algorithm to create some > statistics of the specific page and compared that with statistic about > all kinds of languages and encodings and just mapped the most likely.
More hearsay: I believe language-based heuristics are common. You first guess an encoding based on the bytes you see, then guess a language of the page. If you then get a lot of characters that should not appear in texts of the language (e.g. a lot of umlaut characters in a French page), you know your guess was wrong, and you try a different language for that encoding. If you run out of languages, you guess a different encoding. Mozilla can guess the encoding if you tell it what the language is, which sounds like a similar approach. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list