Steve D'Aprano <steve+pyt...@pearwood.info>: > On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote: >> Python3's strings don't give me any better random access than UTF-8. > > Say what? Of course they do. > > Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of > generality, we can say that each string is an array of four-byte code units.
Yes, and a UTF-8 byte array gives me random access to the UTF-8 single-byte code units. Neither gives me random access to the "Grapheme clusters, a.k.a.real characters". For example, the HFS+ file system stores uses a variant of NFD for filenames meaning both UTF-32 and UTF-8 give you random access to pure ASCII filenames only. > UTF-8 is not: it is a variable-width encoding, UTF-32 is a variable-width encoding as well. For example, "baby: medium skin tone" is U+1F476 U+1F3FD: <URL: http://unicode.org/emoji/charts/full-emoji-list.html#1f476_1f3fd> > Go ignores this problem by simply not offering random access to code > points in strings. Random access to code points is as uninteresting as random access to UTF-8 bytes. I might want random access to the "Grapheme clusters, a.k.a.real characters". As you have pointed out, that wish is impossible to grant unambiguously. Marko -- https://mail.python.org/mailman/listinfo/python-list