On Mon, Jun 25, 2018 at 11:33 AM, Lee <ler...@gmail.com> wrote: > I'm still trying to figure utf-8 out, but it seems to me that 0x0 - > 0xff is part of the utf-8 encoding.
I don't see how you arrived at this. An initial byte of 0xFF is not the initial byte of any valid UTF-8 byte sequence. And it doesn't conform with the statement you have later: > An easy way to remember this transformation format is to note that the > number of high-order 1's in the first byte is the same as the number of > subsequent bytes in the multibyte character: This is true, but there is also a zero bit that ends the high-order-1's bit string, which means that 0xFF is not a valid lead byte. 0x7F is the highest byte value that you can have as a single-byte UTF8 string. Perhaps your statement about 0-0xFF was meant to be read differently. Thomas Wolff's note seems to be objecting to the inclusion of characters above U+10FFFF which isn't legal UTF-8, but was in the original proposal. Otherwise your table rows 1-4 is correct. The standards such as IETF RFC-3629 are easy enough to read, so I recommend using them and citing them to others instead of trying to summarize. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple