>> What does POSIX say? > [...] > 2. Each byte in the UTF-8 encoding is interpreted as ASCII
As soon as any of the input codepoints are non-ASCII, UTF-8 generates octets which are ouside the ASCII range and thus cannot be interpreted as ASCII (at least not without further processing). > 3. If there's a matching character, use that one. If not, insert a > hex encoding of the byte. Provided the "If not" case covers the case where the octet isn't ASCII, this is then well-defined...provided manpage names are taken as sequences of characters. > AFAIK the DNS now uses "Punycode" to encode non-ASCII Unicode > characters in domain names. It's confusing and likely has no > advantage over a straight-up hex encoding. I think it does, for its design use case. The major advantage I see is that it's more compact; hex encoding doubles or, with the % prefix, triples, octet count, and to compare fairly with punycode you have to first convert the Unicode codepoint string into an octet string; assuming this is done with UTF-8, it leads to two to five times as many octets in the intermediate string as there are codepoints in the original string (counting only the non-ASCII characters, of course). This count is then doubled or tripled, leading to at least four and possibly as many as 15 output octets per (non-ASCII) input codepoint. Since there is a small maximum - 63 - on DNS label length, this degree of expansion is undesirable. Punycode is substantially more compact. See the examples in RFC3492. I am actually somewhat surprised they didn't just specify use of UTF-8. The DNS supports all 256 possible octet values in labels, except that there is the historical misfeature that 26 of them are treated as identical to a different 26. I see no particular reason to not just use UTF-8 labels. Presumably they had some, but if it's in 3492 then my reading missed it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B