[issue31484] Cache single-character strings outside of the Latin1 range

Terry J. Reedy Fri, 15 Sep 2017 13:53:56 -0700

Terry J. Reedy added the comment:

I looked at the Gutenburg samples.  The first has a short intro with some 
English, then is pure Greek.  The patch is clearly good for anyone using mostly 
a single block alphabetic language.


The second is Chinese, not hieroglyphs (ancient Egyptian).  A slowdown for 
ancient Egyptian is irrelevant; a slowdown for Chinese is undesirable.  
Japanese mostly uses about 2000 Chinese chars, the Chinses more.  Even if the 
common chars are grouped together (I don't know), there are at least 10 
possible chars for each 2-char slot.  So I am not surprised at a net slowdown.  
I would also not be surprised if Japanese fared worse, as it uses at least 2 
blocks for its kana and uses many latin chars.

Unless we go beyond 2 x 256 slots, caching CJK is hopeless.  Have you 
considered limiting the caching to the blocks before the CJK blocks, up to, 
say, U+31BF?  https://en.wikipedia.org/wiki/Unicode_block.  Both Japanese and 
Korean might then see an actual speedup.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31484>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31484] Cache single-character strings outside of the Latin1 range

Reply via email to