[issue14923] Even faster UTF-8 decoding

Serhiy Storchaka Sun, 27 May 2012 08:49:55 -0700

Serhiy Storchaka <[email protected]> added the comment:

Yes, this is an implementation-dependent behavior (and on the supported 
platforms it is implemented well in a certain way).


However, if the continuation byte check to do the simplest way ((ch) >= 0x80 && 
(ch) < 0xC0), this has the same effect (speed up to +45%) on AMD Athlon.

                                          vanilla      patched

utf-8     'A'*10000                       2061 (-2%)   2018
utf-8     '\x80'*10000                    383 (+9%)    416
utf-8       '\x80'+'A'*9999               1273 (+3%)   1315
utf-8     '\u0100'*10000                  382 (+46%)   558
utf-8       '\u0100'+'A'*9999             1239 (+0%)   1245
utf-8       '\u0100'+'\x80'*9999          383 (+46%)   558
utf-8     '\u8000'*10000                  434 (-6%)    408
utf-8       '\u8000'+'A'*9999             1245 (+0%)   1245
utf-8       '\u8000'+'\x80'*9999          382 (+46%)   556
utf-8       '\u8000'+'\u0100'*9999        383 (+45%)   556
utf-8     '\U00010000'*10000              358 (+0%)    359
utf-8       '\U00010000'+'A'*9999         1171 (-0%)   1170
utf-8       '\U00010000'+'\x80'*9999      381 (+30%)   495
utf-8       '\U00010000'+'\u0100'*9999    381 (+30%)   495
utf-8       '\U00010000'+'\u8000'*9999    404 (-5%)    385

On Intel Atom the results did not change or become a little better.

                                          vanilla      patched

utf-8     'A'*10000                       623 (+3%)    642
utf-8     '\x80'*10000                    145 (+9%)    158
utf-8       '\x80'+'A'*9999               354 (+4%)    367
utf-8     '\u0100'*10000                  164 (+0%)    164
utf-8       '\u0100'+'A'*9999             343 (+2%)    351
utf-8       '\u0100'+'\x80'*9999          164 (+1%)    165
utf-8     '\u8000'*10000                  175 (-2%)    171
utf-8       '\u8000'+'A'*9999             349 (+3%)    359
utf-8       '\u8000'+'\x80'*9999          164 (+0%)    164
utf-8       '\u8000'+'\u0100'*9999        164 (+0%)    164
utf-8     '\U00010000'*10000              152 (-1%)    150
utf-8       '\U00010000'+'A'*9999         313 (+2%)    319
utf-8       '\U00010000'+'\x80'*9999      161 (+1%)    162
utf-8       '\U00010000'+'\u0100'*9999    161 (+1%)    162
utf-8       '\U00010000'+'\u8000'*9999    160 (-2%)    156

----------
Added file: http://bugs.python.org/file25733/decode_utf8_range_check.patch

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue14923>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue14923] Even faster UTF-8 decoding

Reply via email to