On Sun, Mar 9, 2014 at 2:01 PM, Roy Smith <r...@panix.com> wrote: > In article <531bd709$0$29985$c3e8da3$54964...@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > >> There are various common ways to store Unicode strings in RAM. >> >> The first, UTF-16. >> [...] >> Another option is UTF-32. >> [...] >> Another option is to use UTF-8 internally. >> [...] >> In Python 3.3, CPython introduced an internal scheme that gives the best >> of all worlds. When a string is created, Python uses a different >> implementation depending on the characters in the string: > > This was an excellent post, but I would take exception to the "best of > all worlds" statement. I would put it a little less absolutely and say > something like, "a good compromise for many common use cases". I would > even go with, "... for most common use cases". But, there are > situations where it loses.
It's universally good for string indexing/slicing on binary CPUs (there's no point using a 24-bit or 21-bit representation on an Intel-compatible CPU, even though they'd be just as good as UTC-32). It's not a compromise, so much as a recognition that Python offers convenient operators for indexing and slicing. If, on the other hand, Python fundamentally worked with U+0020 separated words (REXX has a whole set of word-based functions), then it might be better to represent strings as lists of words internally. Or if the string operations are primarily based on the transitions between Unicode types of "space" and "non-space", which would be more likely these days, then something of that sort would still work. Anyway, it's based on the operations the language makes convenient, and which will therefore be common and expected to be fast: those are the operations to optimize for. If the only thing you ever do with a string is iterate sequentially over its characters, UTF-8 would be the perfect representation. It's compact, you can concatenate strings without re-encoding, and it iterates forwards easily. But it sucks for "give me character #142857 from this string", so it's a bad choice for Python. ChrisA -- https://mail.python.org/mailman/listinfo/python-list