[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Marc-Andre Lemburg Wed, 31 Mar 2010 11:15:10 -0700

Marc-Andre Lemburg <m...@egenix.com> added the comment:

I guess the term "failing" byte somewhat underdefined.


Page 95 of the standard PDF 
(http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace 
each maximal subpart of an ill-formed subsequence by a single U+FFFD".

Fortunately, they explain what they are after: if a subsequent byte in the 
sequence does not have the high bit set, it's not to be considered part of the 
UTF-8 sequence of the code point.

Implementing that should be fairly straight-forward by adjusting the endinpos 
variable accordingly.

Any takers ?

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to