On Tue, Jul 17, 2018 at 5:40 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Terry Reedy <tjre...@udel.edu>: > >> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote: >>> if your new system used Python3's UTF-32 strings as a foundation, >> >> Since 3.3, Python's strings are not (always) UFT-32 strings. > > You are right. Python's strings are a superset of UTF-32. More > accurately, Python's strings are UTF-32 plus surrogate characters. > >> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the >> always Latin-1 or Ascii strings. Python's Flexible String >> Representation uses the narrowest possible internal code for any >> particular string. This is all transparent to the user except for >> memory size. > > How CPython chooses to represent its strings internally is not what I'm > talking about.
Then don't talk about UTF-32, which is a representation format. >>> UTF-32, after all, is a variable-width encoding. >> >> Nope. It a fixed-width (32 bits, 4 bytes) encoding. >> >> Perhaps you should ask more questions before pontificating. > > You mean each code point is one code point wide. But that's rather an > irrelevant thing to state. The main point is that UTF-32 (aka Unicode) > uses one or more code points to represent what people would consider an > individual character. No, each code point is one code unit wide. It's not irrelevant. > The letter "a" is encoded as a single code point, but š¬š§ (Flag, United > Kingdom) is two code points wide and š“ (Flag, England) is seven (!) > code points wide, not to forget š§āāļø (Man in Steamy Room) with four code > points. <URL: https://unicode.org/emoji/charts/full-emoji-list.html> > > And of course, regular West-European letters can be represented by > multiple code points. > > Code points are about as interesting as individual bytes in UTF-8. Individual bytes in UTF-8 do not have individual meaning. Individual code points do, with the partial exception of the flag characters (which are pretty poorly supported anyway). Otherwise, every code point is either a base character with general meaning, or a combining character (or variant selector) that represents a specific change. They can be composed in different ways. For example: U+006F U+0301 "oĢ" LATIN SMALL LETTER O WITH ACUTE U+006F U+030B "oĢ" LATIN SMALL LETTER O WITH DOUBLE ACUTE U+0075 U+0301 "uĢ" LATIN SMALL LETTER U WITH ACUTE U+0075 U+030B "uĢ" LATIN SMALL LETTER U WITH DOUBLE ACUTE The UTF-8 representations of the combined forms of these characters are: C3 B3 C5 91 C3 BA C5 B1 What does byte value C5 mean? What does 91 mean? None of these has meaning on its own. The only way you can interpret them is as a full set. In contrast, the combining characters have meaning: a base character, or a combining character. So, no, individual code points are very interesting. ChrisA -- https://mail.python.org/mailman/listinfo/python-list