[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Ezio Melotti Fri, 02 Jul 2010 17:49:23 -0700

Ezio Melotti <[email protected]> added the comment:

I've found a subtle corner case about 3- and 4-bytes long sequences.
For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf 
(pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are 
invalid.
I.e. if the first byte is \xe0 and the second byte is between \x80 (included) 
and \xA0 (excluded), then the second byte is invalid (this is because sequences 
< \xe0\xa0\x80 will result in codepoints < U+0800 and these codepoints are 
already represented by two-bytes-long sequences (\xdf\xbf decodes to U+07FF)).


Assume that we want to decode the string b'\xe0\x61\x80\x61' (where \xe0 is the 
start byte of a 3-bytes-long sequence, \x61 is the letter 'a' and \x80 a valid 
continuation byte).
This actually results in:
>>> b'\xe0\x61\x80\x61'.decode('utf-8', 'replace')
'�a�a'
since \x61 is not a valid continuation byte in the sequence:
 * \xe0 is converted to �;
 * \x61 is displayed correctly as 'a';
 * \x80 is valid only as a continuation byte and invalid alone, so it's 
replaced by �;
 * \x61 is displayed correctly as 'a';

Now, assume that we want to do the same with b'\xe0\x80\x81\x61':
This actually results in:
>>> b'\xe0\x80\x81\x61'.decode('utf-8', 'replace')
'��a'
in this case \x80 would be a valid continuation byte, but since it's preceded 
by \xe0 it's not valid.
Since it's not valid, the result might be similar to the previous case, i.e.:
 * \xe0 is converted to �;
 * \x80 is valid as a continuation byte but not in this specific case, so it's 
replaced by �;
 * \x81 is valid only as a continuation byte and invalid alone, so it's 
replaced by �;
 * \x61 is displayed correctly as 'a';
However for this case (and the other similar cases), the invalid bytes wouldn't 
be otherwise valid because they are still in range \x80-\xbf (continuation 
bytes), so the current behavior might be fine.

This happens because the current algorithm just checks that the second byte 
(\x80) is in range \x80-\xbf (i.e. it's a continuation byte) and if it is it 
assumes that the invalid byte is the third (\x81) and replaces the first two 
bytes (\xe0\x80) with a single �.

That said, the algorithm could be improved to check what is the wrong byte with 
better accuracy (and that could also be used to give a better error message 
about decoding surrogates). This shouldn't affect the speed of regular 
decoding, because the extra check will happen only in case of error.
Also note the Unicode standard doesn't seem to mention this case, and that 
anyway this doesn't "eat" any of the following characters as it was doing 
before the patch -- the only difference would be in the number of �.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to