On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-c...@163.com> wrote: > 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: >> how to detect the character encoding in a web page ? >> >> such as this page >> >> >> >> http://python.org/ > > by the way ,we cannot get character encoding programmatically from the mate > data without knowing the character encoding ahead !
The rules for web pages are (massively oversimplified): 1) HTTP header 2) ASCII-compatible encoding and meta tag The HTTP header is completely out of band. This is the best way to transmit encoding information. Otherwise, you assume 7-bit ASCII and start parsing. Once you find a meta tag, you stop parsing and go back to the top, decoding in the new way. "ASCII-compatible" covers a huge number of encodings, so it's not actually much of a problem to do this. ChrisA -- http://mail.python.org/mailman/listinfo/python-list