Strake dixit: >Use wchar.h functions and a sane libc, e.g. musl, which has a pure >UTF-8 C locale, which ISO C explicitly allows [1]. > >The 8-bit clarity what POSIX wants [1] seems nonsense to me, as one >can use byte functions for that, but I may be wrong. ^^^^^^^^^^^^^^^^^^^^^^ Not always, see below.
>[1] http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc MirBSD has exactly one “locale” (just enough to satisfy POSIX), and it’s pure UTF-8 (with a 16-bit wchar_t though) but 8-bit clean. This was a requirement from the start. Imagine this: txtfile and binfile are, respectively, a plain text UTF-8 file and a binary file (say, an ELF object). “with*locale” is a placeholder to set the respective LC_* settings or something. $ withClocale tr x x <txtfile >txtfile2 $ withUTF8locale tr x x <txtfile >txtfile3 $ withClocale tr x x <binfile >binfile2 $ withUTF8locale tr x x <binfile >binfile3 The output of this, when using a character-aware tr(1), will be: • txtfile2 and txtfile3 will be identical to txtfile • binfile2 will be identical to binfile • binfile3 will be 0 bytes long, and the system will have thrown EILSEQ, because the binary file contains sequences that are not conforming UTF-8; this is actually *required* and *correct* and the reason Debian has introduced (on my prodding) a “C.UTF-8” locale, which is just the same as “C” except with UTF-8 encoding, and _always_ installed. Now, on a system with multiple locales, you can just set the appropriate locale when dealing with files you know are binary or UTF-8 text. If you know. But if your “C” locale is UTF-8, you absolutely lose the ability to operate the standard Unix utilities on nōn-UTF-8 files (or, for example, files with mixed encoding). Hilarity ensues (such as nvi in Debian trashing files *on save*, with no warning before and no method to revert) with such files in UTF-8 encodings. You cannot just “use the byte functions” because, for example, you want to use tr(1), or you want to use your favourite editor on a file that’s “mostly” UTF-8 but contains some “raw octets”; the script I use in MirBSD to convert catmanpages to HTML is such an example because these octets (e.g. \xFE and \xFF) are used as separators for sed(1) calls, or placeholders. I hope to have sufficiently shown my case. Now, as for the solution, as first appeared in MirBSD: I have invented a scheme whereas, upon conversion between 8-bit (multibyte) and (in MirBSD) 16-bit (wide) character{s, strings}, every input that is not well-formed UTF-8 (e.g. \xC2 \x80, \xFF) is mapped into a 128-codepoint range in the Private Use Area, and upon conversion back to multibyte, mapped back appropriately. Well-formed UTF-8 on the multibyte side that corresponds to one of these 128 codepoints is *also* taken as invalid in order to guarantee round-tripping; one should not have been storing PUA characters in files in the first place *and* expect to be able to manipulate them on every OS too. (Just round-tripping works, but e.g. tr $'\uEF80' $'\uEF81' will not work.) I’ve tentatively assigned a 128-codepoint PUA range for this, but then contacted the ConScript Unicode Registry which is a voluntary agreement to reserve each others’ ranges, and asked for a 128-codepoint assignment there (and got one, and also registered some other PUA users from Linux and Apple with them). MirBSD now maps \x80 to (wchar_t)0xEF80, \x81 to (wchar_t)0xEF81, etc. up to \xFF to (wchar_t)0xEFFF, when converting from multibyte to wide characters; additionally, the octet sequences ranging from \xEE\xBE\x80 to \xEE\xBF\xBF which are (strictly spoken) valid UTF-8 are mapped to L"\uEFEE\uEFBE\uEF80" - L"\uEFEE\uEFBF\uEFBF". Python 3 later had the same problem, and solved it in the same way, although using a different range. Their system uses “Unicode” strings throughoutly, which is fancy for wchar_t strings (and probably more portable than the C equivalent), but one needs to be able to, for example, call the open(2) equivalent with arbitrary 8-bit filenames (as POSIX filenames are octet strings except NUL and slash). Their mapping is defined in PEP 383, and instead of using the PUA they map to the “second” half of UTF-16 surrogates, arguïng that surrogates never occur unpaired in valid Unicode sequences. (I disagreed because this makes the encoding stateful again, which was one major benefit of Unicode to not have. Martin decided to not switch to the MirBSD PUA mapping; we agreed to disagree there. Additionally, at the current point in time, MirBSD is still (deliberately – to keep the implemen‐ tation small and suckless) “confined” to Unicode BMP, i.e. does n̲o̲t̲ use UTF-16 anyway, so we could not use their definition.) The implementation mostly consists of one macro, two files, and a type definition (which is inspired by Bruno Haible’s libutf8): #define iswoctet(wc) (((wchar_t)(wc) & 0xFF80) == 0xEF80) https://www.mirbsd.org/cvs.cgi/src/kern/c/optu8to16.c?rev=HEAD https://www.mirbsd.org/cvs.cgi/src/kern/c/optu16to8.c?rev=HEAD typedef struct { unsigned int count:2; unsigned int value:12; } __attribute__((__packed__)) mbstate_t; Implementation note: │ typedef short unsigned int wchar_t; │ typedef unsigned int wint_t; (This is not a requirement.) The files are equivalents of wcrtomb (optu16to8; actually, in MirBSD these two functions are precisely the same) respective mbrtowc (optu8to16, but with a small API difference, so this is not a plug-in replacement, for a reason). When calling optu8to16 one byte at a time, it will store info in the mbstate_t argument just like mbrtowc does, but upon encountering some not-UTF-8 sequences, it needs to be able to emit _several_ wide characters _without_ eating up any octets; this makes for the API difference: mbrtowc │optu8to16 ════════════════════════════════╪═══════════════════════════════════════ - │When called with n == 0 it can still │emit up to two wide characters, and │will only thereafter return (size_t)-2. ────────────────────────────────┼─────────────────────────────────────── The s == NULL check is first. │The check for n == 0 comes before the │check for s == NULL so a caller can │flush the state w/o needing to pass │an input string. ────────────────────────────────┼─────────────────────────────────────── A return value of 0 means that │Any return value not (size_t)-2 or -1 the most recent character was │means “number of input octets eaten”. a NUL character and terminates │This implies a return value of 0 means the string. Any other return │“no input eaten but output was still value not (size_t)-2 or -1 (or, │emitted”. in the C11 case, (size_t)-3 │ which does not work with 16-bit │ wchar_t though) means “number │ of input octets eaten”. │ When calling mbrtowc and/or optu8to16 with enough input octets to always form output (be the input UTF-8 or raw octets), there will be no difference; otherwise, calling mbrtowc() will lose the second or third byte of multi-byte invalid input. This is the only known issue with this solution *and* will require some code to be patched (to either always pass “enough” bytes or use the optu8to16 function if present, e.g. from an autoconf check). “Enough” bytes has an upper bound: 3 octets of input are, in a̲l̲l̲ cases, enough to determine whether any given input is valid UTF-8 or whether a PUA mapping needs to be emitted. (This is for 16-bit Unicode; for 21-bit Unicode you would want to pass at least 4 octets (and declare the 5- and 6-byte forms of UTF-8 invalid input).) In most cases, 1 or 2 octets will be enough, but 3 will always work (in the “invalid input” case optu8to16 will just “eat” only one byte making a sliding 3-byte window). The functions given are under a liberal enough licence (MirOS, like MIT), but should this be a problem, talk to me directly. I still reserve the right to “steer” the “OPTU-8” and “OPTU-16” “encoding” (which is just fancy to say for “UTF-8/CESU-8 but raw octets allowed” and “UCS-2 with specific meaning for the PUA ares [EF80;EFFF]”; nl_langinfo(CODESET) returns "UTF-8" on MirBSD of course). This all has been extensively, and in some cases empirically (i.e. with all possible input and state values), been tested, but only with the constraint of supporting 16-bit Unicode. (Note that the wchar_t values 0xFFFE and 0xFFFF are reserved for the (size_t)-1 and (size_t)-2 results, and thus, their corresponding UTF-8 encoding is considered invalid; but then, e̲v̲e̲r̲y̲ system with a 16-bit wchar_t m̲u̲s̲t̲ do the same; those just would need to throw EILSEQ instead of passing it through transparently.) HTH & HAND, //mirabilos -- 13:37⎜«Natureshadow» Deep inside, I hate mirabilos. I mean, he's a good guy. But he's always right! In every fsckin' situation, he's right. Even with his deeply perverted taste in software and borked ambition towards broken OSes - in the end, he's damn right about it :(! […] works in mksh