[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Ezio Melotti Mon, 15 Aug 2011 10:15:44 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

Here are some benchmarks:
Commands:
# half of the bytes are invalid
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", 
"surrogateescape")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", 
"replace")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", 
"ignore")'


With patch:
1000 loops, best of 3: 854 usec per loop
1000 loops, best of 3: 509 usec per loop
1000 loops, best of 3: 415 usec per loop

Without patch:
1000 loops, best of 3: 670 usec per loop
1000 loops, best of 3: 470 usec per loop
1000 loops, best of 3: 382 usec per loop

Commands (from the interactive interpreter):
# all valid codepoints
import timeit
b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 
0xE000)).encode("utf-8")
b_dec = b.decode
timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import 
b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import 
b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import 
b_dec').timeit(100)/100

With patch:
0.03830226898193359
0.03849360942840576
0.03835036039352417
0.03821949005126953

Without patch:
0.03750091791152954
0.037977190017700196
0.04067679166793823
0.038579678535461424

Commands:
# near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string
b2 = bytes(c for k,c in enumerate(b) if k%5)
b2_dec = b2.decode
timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import 
b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import 
b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import 
b2_dec').timeit(10)/10

With patch:
9.645482301712036
6.602735090255737
5.338080596923828

Without patch:
8.124328684806823
5.804249691963196
4.851014900207519

All tests done on wide 3.2.

Since the changes are about errors, decoding of valid utf-8 strings is not 
affected.  Decoding with non-strict error handlers and invalid strings are 
slower, but I don't think the difference is significant.
If the patch is fine I will commit it.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to