On Fri, Nov 21, 2014 at 12:31 PM, <random...@fastmail.us> wrote: > On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote: >> 2) Languages which use a different alphabet (eg Cyrillic - Russian, >> Bulgarian). You could possibly cram them into an eight-bit encoding >> without tipping ASCII out, but I'm not sure. In Unicode, these >> languages are all easily supported by the BMP, as they don't use a >> huge number of characters each. > > There are numerous eight-bit encodings that support latin and one other > alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit > encoding is basically two seven-bit encodings.
I'm aware of this; Greek, for instance, fits quite happily into ISO-8859-7, which is eight-bit. > The most difficult (of those still possible at all) language to encode > in eight bits is actually Vietnamese, which uses the Latin alphabet, due > to the sheer number of accented letters used. Windows' encoding of it > (along with some other lesser used encodings, all for Vietnamese) is the > only 8-bit encoding to use combining accents, in a way unfortunately > incompatible with unicode normalization if naively translated, whereas > VISCII sacrifices a handful of C0 control characters in addition to > fully packing the high half with letters. This is what I was suspicious of. The very notion of "combining accents" already breaks the notion that "a byte is a character is a glyph", which most eight-bit encodings try to pretend. In any case, the BMP still easily copes with them all. (Hmm. I wonder how you'd typeset the old "Self-Pronouncing Alphabet" for English? It's basically English text with a few markings added to letters - not standard diacriticals that already exist in Unicode, but dots. Probably possible, one way or another... but I haven't seen SPA text since the 90s, and that was in stuff published back in the 80s or so.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list