[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

Mike Lewis Wed, 30 Jun 2010 13:22:32 -0700

New submission from Mike Lewis <[email protected]>:

When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')


its not throwing an exception.  '\xed\xbc\xad' is an invalid UTF8 byte sequence.

It maps to the value U+DF2D which is a "surrogate pair" it seems.

http://tools.ietf.org/html/rfc3629#section-4

explains:

      However, pairs of
      UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
      parlance), being actually UCS-4 characters transformed through
      UTF-16, need special treatment: the UTF-16 transformation must be
      undone, yielding a UCS-4 character that is then transformed as
      above.

which would suggest that it is invalid.

However, I think wikipedia's explanation is a bit clearer:

UTF-8 may only legally be used to encode valid Unicode scalar values. According 
to the Unicode standard the high and low surrogate halves used by UTF-16 
(U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, 
and the UTF-8 encoding of them is an invalid byte sequence and should be 
treated as described above.


Thanks,
Mike

----------
components: Unicode
messages: 109010
nosy: Mike.Lewis
priority: normal
severity: normal
status: open
title: Invalid UTF8 Byte sequence not raising exception/being substituted
versions: Python 2.6

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue9133>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

Reply via email to