On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-c...@163.com> wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
>> how to detect the character encoding  in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way  ,we cannot get character encoding programmatically from the mate 
> data without knowing the  character encoding  ahead !

The rules for web pages are (massively oversimplified):

1) HTTP header
2) ASCII-compatible encoding and meta tag

The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and
start parsing. Once you find a meta tag, you stop parsing and go back
to the top, decoding in the new way. "ASCII-compatible" covers a huge
number of encodings, so it's not actually much of a problem to do
this.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to