On Fri, Nov 16, 2012 at 2:44 PM, <b...@yelp.com> wrote: > Latin1 has a block of 32 undefined characters.
These characters are not undefined. 0x80-0x9f are the C1 control codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and their Unicode mappings are well defined. http://tools.ietf.org/html/rfc1345 > Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five > undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D In CP 1252, these codes are actually undefined. http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx > Also, the html5 standard says: > > When a user agent [browser] would otherwise use a character encoding given in > the first column [ISO-8859-1, aka latin1] of the following table to either > convert content to Unicode characters or convert Unicode characters to bytes, > it must instead use the encoding given in the cell in the second column of > the same row [windows-1252, aka cp1252]. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0 > > > The current implementation of windows-1252 isn't usable for this purpose (a > replacement of latin1), since it will throw an error in cases that latin1 > would succeed. You can use a non-strict error handling scheme to prevent the error. >>> b'hello \x81 world'.decode('cp1252') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\python33\lib\encodings\cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined> >>> b'hello \x81 world'.decode('cp1252', 'replace') 'hello \ufffd world' >>> b'hello \x81 world'.decode('cp1252', 'ignore') 'hello world' -- http://mail.python.org/mailman/listinfo/python-list