nick black left as an exercise for the reader:
> it's my understanding that Punycode's objective is to be "clean"
> with regards to things that match against the hostname character
> set, hence its pickup for IDN (where it's expected that DNS
> will be traversing all kinds of network middleware). a similar
> (obsolete) proposal was 7 bit-clean UTF-7.

it occurs to me that the properties of UTF-8 might not be in the
forefront of everyone's minds. there are several good references
to its properties and advantages [0] [1] [2]; i'll quote myself
[3]:

Unicode Technical Report #17 defines seven official Unicode
character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE,
UTF-32, UTF-32BE, and UTF-32LE. What a wealth of encodings! How
is one to choose? The -16BE and -16LE forms are simply UTF-16
with a known byte order; a UTF-16 stream can (optionally!) be
prefixed with a Byte-Order Mark, at which point the stream
reduces to -16LE or -16BE (in the absence of a BOM, the best
advice is to follow your heart). UTF-32 breaks down the same
way. This question of endianness arises from the fact that
UTF-16 and UTF-32 are coded in terms of 16- and 32-bit units.
UTF-8, being coded in terms of individual bytes, has no need to
define byte order.

“Well, that BOM sounds kinda annoying,” I hear you asking. “What
other advantages are offered by UTF-8?” Remember how ANSI
X3.4-1986 maps precisely to the first 128 characters of UCS?
UTF-8 (and only UTF-8, of the official encodings) encodes these
128 characters the same as US-ASCII! Boom! Every ASCII document
you have—including most source code, configuration files, system
files, etc.—is a perfectly valid UTF-8 document. Furthermore,
UTF-8 never encodes non-ASCII characters to the ASCII bytes. So
an arbitrary UTF-8 document may have plenty of high-bit bytes
that your ASCII-aware, POSIX-locale program doesn’t understand,
but it never sees a valid ASCII character where one wasn’t
intended. UTF-8 encodes ASCII’s 0–0x7f to 0–0x7f, and otherwise
never produces a byte in that range. This includes the
all-important null character 0—Boom! Every nul-terminated C
string is a valid UTF-8 string. Every UTF-8 string can be passed
through standard C APIs cleanly, and they’ll more or less work.
It’s furthermore self-synchronizing. If you pick up a UTF-8
stream in the middle, you know after reading a single byte
whether you’re in the middle of a multibyte character.

“Sweet! What’s the catch? Does it waste space?” RFC 3629
limits UTF-8’s range to the 17 ∗ 2^16-ary code space of UCS, in
which case the maximum length of a single UTF-8-encoded UCS code
point is four bytes [4]. It’s thus always as or more efficient
than UCS-32. When the ASCII characters are used, UTF-8 is more
efficient than either UTF-16 or UTF-32. Only for streams utterly
dominated by BMP codepoints requiring three or more bytes from
UTF-8 can UTF-16 encode more efficiently.

“Sweet! What’s the catch? Is it super slow?” UTF-32, it is true,
allows you to index into a string by character in O(1) (UTF-16
does not, unless you’re only dealing with BMP strings). UTF-32
also allows you to compute the bytes necessary for encoding in
O(1), given the number of Unicode codepoints, but that’s only
because it’s wasteful; if you’re willing to be similarly
wasteful, you can do the same calculation with UTF-8 (and then
trim any wastage at the end, if you wish). Any advantage UTF-32
might hold in lexing simplicity is likely a wash when UTF-8’s
usual space efficiency is taken into account, owing to more
effective use of cache and memory bandwidth. Nope, it’s not
slow. *Always interoperate in UTF-8 by default.*

UTF-16 is some truly stupid shit, fit only for jabronies. It
only ever passed muster because people thought UCS was going to
be a sixteen-bit character set. The moment a second Plane was
added, UTF-16 ought have been shown the door. There’s an
argument to be made for ripping it from the pages of books in
your local library. If you must work on a UTF-16 system, use
UTF-16 at the boundary, and then keep it around as UTF-32 or
UTF-8. Always interoperate—including writing files—in UTF-8 by
default.

There are a dozen-odd similarly-named encodings which are useful
for nothing but trivia. UCS-2 was UTF16, but for only the BMP.
UCS-4 is just UTF-32. UTF-7 is a seven-bit-clean UTF-8 [5]. UTF-1
is UTF-8’s older, misshapen sister, locked away from sight in
the attic. UTF-5 and UTF-6 were proposed encodings for IDN, but
Punycode was selected instead. WTF-8 extends UTF-8 to handle
invalid UTF-16 input. BOCU-1 and SCSU are compressing encodings
that don’t compress as well as gzipped UTF-8. UTF-9 and UTF-18
were jokes. Is UTF-EBCDIC a thing? Of course UTF-EBCDIC is a
thing.

The one place where you won’t interoperate with UTF-8 is for
domain name lookup, when converting IDNA into the LDH subset of
ASCII. If you’re interested, consult RFC 3492, and Godspeed.

--rigorously, nick

[0] https://research.swtch.com/utf8
[1] https://utf8everywhere.org/
[2] 
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
[3] https://nick-black.com/htp-notcurses.pdf p52-53
[4] You might hear six bytes, and indeed ISO/IEC 10646 specifies
    six bytes to handle up through U+7FFFFFFF…but only defines UCS
                to cover 17 planes. Verify your wctomb(3) rejects inputs in
                excess of 0x10ffff before exploiting RFC 3629’s tighter bound.
[5] The primary seven-bit-clean media of the modern era is
    probably email sent without a MIME transfer encoding.

-- 
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.

Attachment: signature.asc
Description: PGP signature

Reply via email to