Ezio Melotti <ezio.melo...@gmail.com> added the comment:

After a mail I sent to the Unicode Consortium about the corner case I found, 
they updated the "Best Practices for Using U+FFFD"[0] and now it says:
"""
 Another example illustrates the application of the concept of maximal subpart 
for UTF-8 continuation bytes outside the allowable ranges defined in Table 3-7. 
The UTF-8 sequence <41 E0 9F 80 41> is ill-formed, because <9F> is not an 
allowed second byte of a UTF-8 sequence commencing with <E0>. In this case, 
there is an unconvertible offset at <E0> and the maximal subpart at that offset 
is also <E0>. The subsequence <E0 9F> cannot be a maximal subpart, because it 
is not an initial subsequence of any well-formed UTF-8 code unit sequence.
"""

The result of decoding that string with Python is:
>>> b'\x41\xE0\x9F\x80\x41'.decode('utf-8', 'replace')
'A��A'
i.e. the bytes <E0 9F> are wrongly considered as a maximal subpart and replaced 
with a single '�' (the second � is the \x80).

I'll work on a patch and see how it comes out.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 96

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to