On Wed, Jul 19, 2017 at 11:42 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Chris Angelico <ros...@gmail.com>: > >> Perhaps we don't have the same understanding of "constant time". Or >> are you saying that you actually store and represent this as those >> arbitrary-precision integers? Every character of every string has to >> be a multiprecision integer? > > Yes, although feel free to optimize. The internal implementation isn't > important but those "multiprecision" integers are part of an outward > interface. So you could have: > > >>> for c in Text("aq̈u \U0001F64B\U0001F3FF\u200D\u2642\uFE0F"): > ... print(c) > ... > 97 > 1895826184 > 117 > 32 > 5152920508016097895476141586773579 > > (Note, though, that Python3 only has integers, there's no > "multiprecision" about them.)
I said "multiprecision" because that's what the low-level arbitrary-precision-integer library calls them - GNU Multiprecision Integer [1]. Somehow you're going to have to store those in an indexable way, and since you can't fit arbitrary-precision data into fixed-width slots, O(1) addressing is going to require external storage. Basically, you're representing a string as if it were a tuple of integers. That makes fine sense semantically, but it's pretty costly: >>> sys.getsizeof("hello, world") 61 >>> sys.getsizeof(tuple("hello, world")) 144 That's just for the tuple itself; then you need to represent the actual numbers. Each one will require an addressable memory allocation, which basically means a minimum of 8 bytes (on my 64-bit system; you'd save a bit on a 32-bit Python, but most people don't use those now). So every character in your string requires an 8-byte pointer plus an 8-byte value. In contrast, current CPython requires *at most* four bytes per character, and for many strings, requires only one byte per character - possible only because the data is kept as an array, not as external references. Now, this is a performance question, and it's not unreasonable to talk about semantics first and let performance wait for later. But when you consider how many ASCII-only strings Python uses internally (the names of basically every global function and every attribute in every stdlib module), and how you'll be enlarging those by a factor of 16 *and* making every character lookup require two pointer reads, it's pretty much a non-starter. There MIGHT be something to be done using a sentinel value that represents "the actual value is somewhere else". However, I'm not sure how you could do that cleanly in a one-byte-per-char string other than maintaining some external table. So here's the best I can come up with for efficiency - and it suffers horrendously from complexity: * Strings with all codepoints < 256 are represented as they currently are (one byte per char). There are no combining characters in the first 256 codepoints anyway. * Strings with all codepoints < 65536 and no combining characters, ditto (two bytes per char). * Strings with any combining characters in them are stored in four bytes per char even if all codepoints are <65536. * Any time a character consists of a single base with no combining, it is stored in UTF-32. * Combined characters are stored in the primary array as 0x80000000 plus the index into a secondary array where these values are stored. * The secondary array has a pointer for each combined character (ignoring single-code-point characters), probably to a Python integer object for simplicity. This scheme allows a maximum of two billion combined characters in any string. Worst case, "a\u0303"*0x80000000 is a four billion character string that simply can't be represented; but long before that, you'll run out of space to allocate all those large integers. (Current CPython represents that string in 8GB of memory. Enough to push me into the swapper - I have only 16GB in this system and a lot of it is in use - but nothing I can't handle.) It also has the advantage that most strings won't change in representation. However, the complexity is insane; I don't want to be the one to write all the unit tests to make sure everything behaves as advertised! Also, this system has the nasty implication that the creation of a new combining character will fundamentally change the way a string behaves. That means that running a slightly older version of Python could potentially cause, not errors, but subtly different behaviour. With Python 2.7, 3.4, 3.5, 3.6, and 3.7, I have four different major Unicode versions, which means plenty of potential for newly-allocated codepoints in newer Pythons. That's not usually a problem, as it only affects a few things in the unicodedata module: rosuav@sikorsky:~$ python3.4 -c "import unicodedata; print(unicodedata.name('\U0001f917'))" Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: no such name rosuav@sikorsky:~$ python3.5 -c "import unicodedata; print(unicodedata.name('\U0001f917'))" HUGGING FACE rosuav@sikorsky:~$ python3.6 -c "import unicodedata; print(unicodedata.name('\u1df6'))" Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: no such name rosuav@sikorsky:~$ python3.7 -c "import unicodedata; print(unicodedata.name('\u1df6'))" COMBINING KAVYKA ABOVE RIGHT But if combining characters behave fundamentally differently to others, there would be a change in string representation when U+1DF6 became a combining character. That's going to cause MASSIVE upheaval. I don't think there's any solution to that, but if you can find one, do please elaborate. ChrisA [1] https://gmplib.org/ -- https://mail.python.org/mailman/listinfo/python-list