On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller <kurt.alfred.muel...@gmail.com> wrote: > $ wget -q -O - http://python.org/ | chardetect.py > stdin: ISO-8859-2 with confidence 0.803579722043 > $
And it sucks, because it uses magic, and not reading the HTML tags. The RIGHT thing to do for websites is detect the meta charset definition, which is <meta http-equiv="content-type" content="text/html; charset=utf-8"> or <meta charset="utf-8"> The second one for HTML5 websites, and both may require case conversion and the useless ` /` at the end. But if somebody is using HTML5, you are pretty much guaranteed to get UTF-8. In today’s world, the proper assumption to make is “UTF-8 or GTFO”. Because nobody in the right mind would use something else today. -- Kwpolska <http://kwpolska.tk> stop html mail | always bottom-post www.asciiribbon.org | www.netmeister.org/news/learn2quote.html GPG KEY: 5EAAEA16 -- http://mail.python.org/mailman/listinfo/python-list