On Mon, Nov 08, 2021 at 03:30:14PM -0500, Mouse wrote: > >> What does POSIX say? > > [...] > > 2. Each byte in the UTF-8 encoding is interpreted as ASCII > > As soon as any of the input codepoints are non-ASCII, UTF-8 generates > octets which are ouside the ASCII range and thus cannot be interpreted > as ASCII (at least not without further processing).
As I've had to deal with UTF-8 in UDF, I'd say its not a big deal. There is AFIAK no possibility for confusion with ASCII; only the string length calculation can go wrong. As Unicode has support for glyphs next to characters, UTF-8 supports alternate encodings like U+FF05 (fullwidth percent sign) for U+0025 (%). See https://www.compart.com/en/unicode/category/Po Some replacements could be used for the filenames though not really type-able they are readable and obvious. > > 3. If there's a matching character, use that one. If not, insert a > > hex encoding of the byte. > > Provided the "If not" case covers the case where the octet isn't ASCII, > this is then well-defined...provided manpage names are taken as > sequences of characters. > > > AFAIK the DNS now uses "Punycode" to encode non-ASCII Unicode > > characters in domain names. It's confusing and likely has no > > advantage over a straight-up hex encoding. Its smaller for one; its better readable than &=0xF3; stuff > The major advantage I see is that it's more compact; hex encoding > doubles or, with the % prefix, triples, octet count, and to compare > fairly with punycode you have to first convert the Unicode codepoint > string into an octet string; assuming this is done with UTF-8, it leads > to two to five times as many octets in the intermediate string as there > are codepoints in the original string (counting only the non-ASCII > characters, of course). This count is then doubled or tripled, leading > to at least four and possibly as many as 15 output octets per > (non-ASCII) input codepoint. Since there is a small maximum - 63 - on > DNS label length, this degree of expansion is undesirable. > > Punycode is substantially more compact. See the examples in RFC3492. > > I am actually somewhat surprised they didn't just specify use of UTF-8. > The DNS supports all 256 possible octet values in labels, except that > there is the historical misfeature that 26 of them are treated as > identical to a different 26. I see no particular reason to not just > use UTF-8 labels. Presumably they had some, but if it's in 3492 then > my reading missed it. Indeed, UTF-8 would have sufficed IMHO. Reinoud