nick black left as an exercise for the reader: > it's my understanding that Punycode's objective is to be "clean" > with regards to things that match against the hostname character > set, hence its pickup for IDN (where it's expected that DNS > will be traversing all kinds of network middleware). a similar > (obsolete) proposal was 7 bit-clean UTF-7.
it occurs to me that the properties of UTF-8 might not be in the forefront of everyone's minds. there are several good references to its properties and advantages [0] [1] [2]; i'll quote myself [3]: Unicode Technical Report #17 defines seven official Unicode character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. What a wealth of encodings! How is one to choose? The -16BE and -16LE forms are simply UTF-16 with a known byte order; a UTF-16 stream can (optionally!) be prefixed with a Byte-Order Mark, at which point the stream reduces to -16LE or -16BE (in the absence of a BOM, the best advice is to follow your heart). UTF-32 breaks down the same way. This question of endianness arises from the fact that UTF-16 and UTF-32 are coded in terms of 16- and 32-bit units. UTF-8, being coded in terms of individual bytes, has no need to define byte order. “Well, that BOM sounds kinda annoying,” I hear you asking. “What other advantages are offered by UTF-8?” Remember how ANSI X3.4-1986 maps precisely to the first 128 characters of UCS? UTF-8 (and only UTF-8, of the official encodings) encodes these 128 characters the same as US-ASCII! Boom! Every ASCII document you have—including most source code, configuration files, system files, etc.—is a perfectly valid UTF-8 document. Furthermore, UTF-8 never encodes non-ASCII characters to the ASCII bytes. So an arbitrary UTF-8 document may have plenty of high-bit bytes that your ASCII-aware, POSIX-locale program doesn’t understand, but it never sees a valid ASCII character where one wasn’t intended. UTF-8 encodes ASCII’s 0–0x7f to 0–0x7f, and otherwise never produces a byte in that range. This includes the all-important null character 0—Boom! Every nul-terminated C string is a valid UTF-8 string. Every UTF-8 string can be passed through standard C APIs cleanly, and they’ll more or less work. It’s furthermore self-synchronizing. If you pick up a UTF-8 stream in the middle, you know after reading a single byte whether you’re in the middle of a multibyte character. “Sweet! What’s the catch? Does it waste space?” RFC 3629 limits UTF-8’s range to the 17 ∗ 2^16-ary code space of UCS, in which case the maximum length of a single UTF-8-encoded UCS code point is four bytes [4]. It’s thus always as or more efficient than UCS-32. When the ASCII characters are used, UTF-8 is more efficient than either UTF-16 or UTF-32. Only for streams utterly dominated by BMP codepoints requiring three or more bytes from UTF-8 can UTF-16 encode more efficiently. “Sweet! What’s the catch? Is it super slow?” UTF-32, it is true, allows you to index into a string by character in O(1) (UTF-16 does not, unless you’re only dealing with BMP strings). UTF-32 also allows you to compute the bytes necessary for encoding in O(1), given the number of Unicode codepoints, but that’s only because it’s wasteful; if you’re willing to be similarly wasteful, you can do the same calculation with UTF-8 (and then trim any wastage at the end, if you wish). Any advantage UTF-32 might hold in lexing simplicity is likely a wash when UTF-8’s usual space efficiency is taken into account, owing to more effective use of cache and memory bandwidth. Nope, it’s not slow. *Always interoperate in UTF-8 by default.* UTF-16 is some truly stupid shit, fit only for jabronies. It only ever passed muster because people thought UCS was going to be a sixteen-bit character set. The moment a second Plane was added, UTF-16 ought have been shown the door. There’s an argument to be made for ripping it from the pages of books in your local library. If you must work on a UTF-16 system, use UTF-16 at the boundary, and then keep it around as UTF-32 or UTF-8. Always interoperate—including writing files—in UTF-8 by default. There are a dozen-odd similarly-named encodings which are useful for nothing but trivia. UCS-2 was UTF16, but for only the BMP. UCS-4 is just UTF-32. UTF-7 is a seven-bit-clean UTF-8 [5]. UTF-1 is UTF-8’s older, misshapen sister, locked away from sight in the attic. UTF-5 and UTF-6 were proposed encodings for IDN, but Punycode was selected instead. WTF-8 extends UTF-8 to handle invalid UTF-16 input. BOCU-1 and SCSU are compressing encodings that don’t compress as well as gzipped UTF-8. UTF-9 and UTF-18 were jokes. Is UTF-EBCDIC a thing? Of course UTF-EBCDIC is a thing. The one place where you won’t interoperate with UTF-8 is for domain name lookup, when converting IDNA into the LDH subset of ASCII. If you’re interested, consult RFC 3492, and Godspeed. --rigorously, nick [0] https://research.swtch.com/utf8 [1] https://utf8everywhere.org/ [2] https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ [3] https://nick-black.com/htp-notcurses.pdf p52-53 [4] You might hear six bytes, and indeed ISO/IEC 10646 specifies six bytes to handle up through U+7FFFFFFF…but only defines UCS to cover 17 planes. Verify your wctomb(3) rejects inputs in excess of 0x10ffff before exploiting RFC 3629’s tighter bound. [5] The primary seven-bit-clean media of the modern era is probably email sent without a MIME transfer encoding. -- nick black -=- https://nick-black.com to make an apple pie from scratch, you need first invent a universe.
signature.asc
Description: PGP signature