[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Ezio Melotti Sat, 03 Apr 2010 22:49:27 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

This new patch (v3) should be ok. 
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the 
start byte of a 3-byte sequence and there were only two bytes in the string. 
This is now fixed in the latest patch.


I also unrolled all the loops except the first one because I haven't found an 
elegant way to unroll it (yet).

Finally, I changed the error messages to make them clearer:
unexpected code byte -> invalid start byte;
invalid data -> invalid continuation byte.
(I can revert this if the old messages are better or if it is better to fix 
this with a separate commit.)

The performances seem more or less the same, I did some benchmarks without 
significant changes in the results. If you have better benchmarks let me know. 
I used a file of 320kB with some ASCII, ASCII mixed with some accented 
characters, Japanese and a file with a sample of several different Unicode 
chars.

----------
Added file: http://bugs.python.org/file16754/issue8271v3.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to