Ezio Melotti <ezio.melo...@gmail.com> added the comment: This new patch (v3) should be ok. I added a few more tests and found another corner case: '\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.
I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet). Finally, I changed the error messages to make them clearer: unexpected code byte -> invalid start byte; invalid data -> invalid continuation byte. (I can revert this if the old messages are better or if it is better to fix this with a separate commit.) The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars. ---------- Added file: http://bugs.python.org/file16754/issue8271v3.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8271> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com