New submission from Serhiy Storchaka:

Single-character strings in the Latin1 range (U+0000 - U+00FF) are shared in 
CPython. This saves memory and CPU time of per-character processing of strings 
containing ASCII characters and characters from Latin based alphabets. But the 
users of languages that use non-Latin based alphabets are not so lucky. 
Proposed PR adds a cache for characters in BMP (U+0100 - U+FFFF) which covers 
most alphabetic scripts.

Most alphabets contain characters from a single 128- or 256-character block, 
therefore only lowest bits are used for addressing in the cache. But together 
with the characters from a particular alphabet it is common to use ASCII 
characters (spaces, newline, punctuations, digits, etc) and few Unicode 
punctuation (long dash, Unicode quotes, etc). Their low bits  can match low 
bits of letters. Therefore every index addresses not a single character, but a 
mini-LRU-cache of size 2. This keeps letters in a cache even if non-letters 
with conflicting low bits are occurred.

Microbenchmarking results.

Iterating sample non-Latin-based alphabetic text (Iliad by Homer [1]) is over 
70% faster:

$ ./python -m timeit -s 's = open("36248-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 14.5 msec per loop
Patched:    50 loops, best of 5: 8.32 msec per loop

Iterating sample hieroglyphic text (Shui Hu Zhuan by Nai an Shi [2]) is about 
4% slower:

$ ./python -m timeit -s 's = open("23863-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 11.7 msec per loop
Patched:    20 loops, best of 5: 12.1 msec per loop

Iterating a string containing non-repeated characters from the all BMP range is 
20% slower:

$ ./python -m timeit -s 's = "".join(map(chr, range(0x10000)))' -- 'for c in s: 
pass'
Unpatched:  200 loops, best of 5: 1.39 msec per loop
Patched:    200 loops, best of 5: 1.7 msec per loop


[1] https://www.gutenberg.org/files/36248/36248-0.txt
[2] https://www.gutenberg.org/files/23863/23863-0.txt

----------
components: Interpreter Core, Unicode
messages: 302253
nosy: benjamin.peterson, ezio.melotti, haypo, lemburg, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Cache single-character strings outside of the Latin1 range
type: performance
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31484>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to