On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote: >> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano >> <steve+comp.lang.pyt...@pearwood.info> wrote: >> >>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote: >>> >>> if your new system used Python3's UTF-32 strings as a foundation, that >>> would be an equally naïve misstep. You'd need to reach a notch higher >>> and use glyphs or other "semiotic atoms" as building blocks. UTF-32, >>> after all, is a variable-width encoding. >> >> Python's strings aren't UTF-32. They are sequences of abstract code >> points. >> >> UTF-32 is not a variable-width encoding. >> >> -- >> Steven D'Aprano >> >> > Many consider that UTF-32 is a variable-width encoding because of the > combining characters. It can take multiple ‘codepoints’ to define what > should be a single ‘character’ for display.
Ah, well if we're going to start making up our own definitions of terms, then ASCII is a variable-width encoding too. "Ch" (a single letter of the alphabet in a number of European languages, including Welsh and Czech) requires two code points in ASCII. Even in English, "qu" could be considered a two-byte "character" (grapheme), and for ASCII users, (c) is a THREE code point character for what ought to be a single character ©. The standard definition of variable- and fixed-width encodings refers to how many *code units* is required to make up a single *code point*. Under that standard definition, UTF-8 and UTF-16 are variable-width, and UTF-32 is fixed-width. But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII is too. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list