On Fri, Nov 16, 2012 at 2:44 PM,  <b...@yelp.com> wrote:
> Latin1 has a block of 32 undefined characters.

These characters are not undefined.  0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345

> Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five 
> undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

> Also, the html5 standard says:
>
> When a user agent [browser] would otherwise use a character encoding given in 
> the first column [ISO-8859-1, aka latin1] of the following table to either 
> convert content to Unicode characters or convert Unicode characters to bytes, 
> it must instead use the encoding given in the cell in the second column of 
> the same row [windows-1252, aka cp1252].
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
>
>
> The current implementation of windows-1252 isn't usable for this purpose (a 
> replacement of latin1), since it will throw an error in cases that latin1 
> would succeed.

You can use a non-strict error handling scheme to prevent the error.

>>> b'hello \x81 world'.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
6: character maps to <undefined>

>>> b'hello \x81 world'.decode('cp1252', 'replace')
'hello \ufffd world'
>>> b'hello \x81 world'.decode('cp1252', 'ignore')
'hello  world'
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to