|>>> '\x80'.decode('cp936') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 0: incomplete multibyte sequence
However: Retrieved 2010-10-10 from http://www.unicode.org/Public /MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT # Name: cp936 to Unicode table # Unicode version: 2.0 # Table version: 2.01 # Table format: Format A # Date: 1/7/2000 # # Contact: shawn.ste...@microsoft.com ... 0x7F 0x007F #DELETE 0x80 0x20AC #EURO SIGN 0x81 #DBCS LEAD BYTE Retrieved 2010-10-10 from http://msdn.microsoft.com/en-us/goglobal/cc305153.aspx Windows Codepage 936 [pictorial mapping; shows 80 mapping to 20AC] Retrieved 2010-10-10 from http://demo.icu-project.org /icu-bin/convexp?conv=windows-936-2000&s=ALL [pictorial mapping for converter "windows-936-2000" with aliases including GBK, CP936, MS936; shows 80 mapping to 20AC] So Microsoft appears to think that cp936 includes the euro, and the ICU project seem to think that GBK and cp936 both include the euro. A couple of questions: Is this a bug or a shrug? Where can one find the mapping tables from which the various CJK codecs are derived? -- http://mail.python.org/mailman/listinfo/python-list