Serhiy Storchaka <storch...@gmail.com> added the comment: Yes, this is an implementation-dependent behavior (and on the supported platforms it is implemented well in a certain way).
However, if the continuation byte check to do the simplest way ((ch) >= 0x80 && (ch) < 0xC0), this has the same effect (speed up to +45%) on AMD Athlon. vanilla patched utf-8 'A'*10000 2061 (-2%) 2018 utf-8 '\x80'*10000 383 (+9%) 416 utf-8 '\x80'+'A'*9999 1273 (+3%) 1315 utf-8 '\u0100'*10000 382 (+46%) 558 utf-8 '\u0100'+'A'*9999 1239 (+0%) 1245 utf-8 '\u0100'+'\x80'*9999 383 (+46%) 558 utf-8 '\u8000'*10000 434 (-6%) 408 utf-8 '\u8000'+'A'*9999 1245 (+0%) 1245 utf-8 '\u8000'+'\x80'*9999 382 (+46%) 556 utf-8 '\u8000'+'\u0100'*9999 383 (+45%) 556 utf-8 '\U00010000'*10000 358 (+0%) 359 utf-8 '\U00010000'+'A'*9999 1171 (-0%) 1170 utf-8 '\U00010000'+'\x80'*9999 381 (+30%) 495 utf-8 '\U00010000'+'\u0100'*9999 381 (+30%) 495 utf-8 '\U00010000'+'\u8000'*9999 404 (-5%) 385 On Intel Atom the results did not change or become a little better. vanilla patched utf-8 'A'*10000 623 (+3%) 642 utf-8 '\x80'*10000 145 (+9%) 158 utf-8 '\x80'+'A'*9999 354 (+4%) 367 utf-8 '\u0100'*10000 164 (+0%) 164 utf-8 '\u0100'+'A'*9999 343 (+2%) 351 utf-8 '\u0100'+'\x80'*9999 164 (+1%) 165 utf-8 '\u8000'*10000 175 (-2%) 171 utf-8 '\u8000'+'A'*9999 349 (+3%) 359 utf-8 '\u8000'+'\x80'*9999 164 (+0%) 164 utf-8 '\u8000'+'\u0100'*9999 164 (+0%) 164 utf-8 '\U00010000'*10000 152 (-1%) 150 utf-8 '\U00010000'+'A'*9999 313 (+2%) 319 utf-8 '\U00010000'+'\x80'*9999 161 (+1%) 162 utf-8 '\U00010000'+'\u0100'*9999 161 (+1%) 162 utf-8 '\U00010000'+'\u8000'*9999 160 (-2%) 156 ---------- Added file: http://bugs.python.org/file25733/decode_utf8_range_check.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue14923> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com