On Sat, 26 Aug 2017 18:55:25 +0300 Eli Zaretskii via Unicode <unicode@unicode.org> wrote:
> > Date: Sat, 26 Aug 2017 16:09:33 +0100 > > From: Richard Wordingham via Unicode <unicode@unicode.org> > > It shouldn't. UTF-16 works just like UTF-8, except that the code > > units are bigger. > Not really, since UTF-8 doesn't have surrogates. It has 115 surrogates, thoroughly oppressed by the UTC - there are 64 trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 , and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13 uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one of the few systems that comes close to allowing them the dignity of integer values of their own - 3FFF80₁₆ to 3FFFFF₁₆ for the code units 0x80 to 0xFF. I well remembered when Unicode regular expressions were required to allow one to search for lone surrogates, but there was no such concept of looking for isolated ill-associated bytes in Unicode 8-bit strings. The point is that if one understands how UTF-8 works, UTF-16 is a system that works using a subset of the same principles, and one should therefore understand how UTF-16 works, until one comes to the weird and dubious concept of surrogate points having properties. I believe the latter concept is of value only in code that lacks the concept of gibberish. In UTF-8, the distinction between code unit value and Unicode scalar value is very clear; in UTF-16, it is muddied by the concept of 'codepoint'. Richard.