New submission from Serhiy Storchaka: Single-character strings in the Latin1 range (U+0000 - U+00FF) are shared in CPython. This saves memory and CPU time of per-character processing of strings containing ASCII characters and characters from Latin based alphabets. But the users of languages that use non-Latin based alphabets are not so lucky. Proposed PR adds a cache for characters in BMP (U+0100 - U+FFFF) which covers most alphabetic scripts.
Most alphabets contain characters from a single 128- or 256-character block, therefore only lowest bits are used for addressing in the cache. But together with the characters from a particular alphabet it is common to use ASCII characters (spaces, newline, punctuations, digits, etc) and few Unicode punctuation (long dash, Unicode quotes, etc). Their low bits can match low bits of letters. Therefore every index addresses not a single character, but a mini-LRU-cache of size 2. This keeps letters in a cache even if non-letters with conflicting low bits are occurred. Microbenchmarking results. Iterating sample non-Latin-based alphabetic text (Iliad by Homer [1]) is over 70% faster: $ ./python -m timeit -s 's = open("36248-0.txt").read()' -- 'for c in s: pass' Unpatched: 20 loops, best of 5: 14.5 msec per loop Patched: 50 loops, best of 5: 8.32 msec per loop Iterating sample hieroglyphic text (Shui Hu Zhuan by Nai an Shi [2]) is about 4% slower: $ ./python -m timeit -s 's = open("23863-0.txt").read()' -- 'for c in s: pass' Unpatched: 20 loops, best of 5: 11.7 msec per loop Patched: 20 loops, best of 5: 12.1 msec per loop Iterating a string containing non-repeated characters from the all BMP range is 20% slower: $ ./python -m timeit -s 's = "".join(map(chr, range(0x10000)))' -- 'for c in s: pass' Unpatched: 200 loops, best of 5: 1.39 msec per loop Patched: 200 loops, best of 5: 1.7 msec per loop [1] https://www.gutenberg.org/files/36248/36248-0.txt [2] https://www.gutenberg.org/files/23863/23863-0.txt ---------- components: Interpreter Core, Unicode messages: 302253 nosy: benjamin.peterson, ezio.melotti, haypo, lemburg, serhiy.storchaka priority: normal severity: normal stage: patch review status: open title: Cache single-character strings outside of the Latin1 range type: performance versions: Python 3.7 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue31484> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com