On Sat, May 28, 2016, at 00:46, Rustom Mody wrote: > Which also means that if the Chinese were to have more say in the > design of Unicode/ UTF-8 they would likely not waste swathes of prime > real-estate for almost never used control characters just in the name > of ASCII compliance
There are only 128 code points in the single-byte range of UTF-8. Only 32 of which are used for, almost-never-used or otherwise, control characters. What do you imagine they would have put there instead? At least Unicode doesn't do as badly as the first-draft ISO-UCS, which didn't allow a C0/C1 control value in *any* position in UCS-2 or UCS-4, therefore UCS-2 would encode only 192*192=36,864 codepoints as two bytes (and 64 control characters as one byte), as opposed to UTF-16's 63,488 (including all control characters) two-byte characters. For completeness, I'll note that conventional East Asian character coding systems do have a higher information density compared to UTF-8, but at a cost of not being self-synchronizing. And their single-byte characters are in fact ASCII and C0/C1 controls, with only Japanese Shift- JIS encodings additionally having Katakana as single-byte characters. -- https://mail.python.org/mailman/listinfo/python-list