Re: Grapheme clusters, a.k.a.real characters

Gregory Ewing Wed, 19 Jul 2017 15:18:21 -0700

Chris Angelico wrote:

* Strings with all codepoints < 256 are represented as they currently
are (one byte per char). There are no combining characters in the
first 256 codepoints anyway.
* Strings with all codepoints < 65536 and no combining characters,
ditto (two bytes per char).
* Strings with any combining characters in them are stored in four
bytes per char even if all codepoints are <65536.
* Any time a character consists of a single base with no combining, it
is stored in UTF-32.
* Combined characters are stored in the primary array as 0x80000000
plus the index into a secondary array where these values are stored.
* The secondary array has a pointer for each combined character
(ignoring single-code-point characters), probably to a Python integer
object for simplicity.


+1. We should totally do this just to troll the RUE!

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

Reply via email to