8-bit transparency in the C locale vs. UTF-8 support (was Re: [dev] [sbase][RFC] Add a simplistic version of tr)

Thorsten Glaser Tue, 24 Dec 2013 14:43:57 -0800

Strake dixit:

>Use wchar.h functions and a sane libc, e.g. musl, which has a pure
>UTF-8 C locale, which ISO C explicitly allows [1].
>
>The 8-bit clarity what POSIX wants [1] seems nonsense to me, as one
>can use byte functions for that, but I may be wrong.
 ^^^^^^^^^^^^^^^^^^^^^^
Not always, see below.


>[1] http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc

MirBSD has exactly one “locale” (just enough to satisfy POSIX),
and it’s pure UTF-8 (with a 16-bit wchar_t though) but 8-bit clean.
This was a requirement from the start.

Imagine this: txtfile and binfile are, respectively, a plain text
UTF-8 file and a binary file (say, an ELF object). “with*locale”
is a placeholder to set the respective LC_* settings or something.

$ withClocale tr x x <txtfile >txtfile2
$ withUTF8locale tr x x <txtfile >txtfile3
$ withClocale tr x x <binfile >binfile2
$ withUTF8locale tr x x <binfile >binfile3

The output of this, when using a character-aware tr(1), will be:
• txtfile2 and txtfile3 will be identical to txtfile
• binfile2 will be identical to binfile
• binfile3 will be 0 bytes long, and the system will
  have thrown EILSEQ, because the binary file contains
  sequences that are not conforming UTF-8; this is actually
  *required* and *correct* and the reason Debian has introduced
  (on my prodding) a “C.UTF-8” locale, which is just the same
  as “C” except with UTF-8 encoding, and _always_ installed.

Now, on a system with multiple locales, you can just set the
appropriate locale when dealing with files you know are binary
or UTF-8 text. If you know.

But if your “C” locale is UTF-8, you absolutely lose the ability
to operate the standard Unix utilities on nōn-UTF-8 files (or,
for example, files with mixed encoding). Hilarity ensues (such
as nvi in Debian trashing files *on save*, with no warning before
and no method to revert) with such files in UTF-8 encodings.

You cannot just “use the byte functions” because, for example,
you want to use tr(1), or you want to use your favourite editor
on a file that’s “mostly” UTF-8 but contains some “raw octets”;
the script I use in MirBSD to convert catmanpages to HTML is
such an example because these octets (e.g. \xFE and \xFF) are
used as separators for sed(1) calls, or placeholders.

I hope to have sufficiently shown my case.


Now, as for the solution, as first appeared in MirBSD:

I have invented a scheme whereas, upon conversion between 8-bit
(multibyte) and (in MirBSD) 16-bit (wide) character{s, strings},
every input that is not well-formed UTF-8 (e.g. \xC2 \x80, \xFF)
is mapped into a 128-codepoint range in the Private Use Area,
and upon conversion back to multibyte, mapped back appropriately.
Well-formed UTF-8 on the multibyte side that corresponds to one
of these 128 codepoints is *also* taken as invalid in order to
guarantee round-tripping; one should not have been storing PUA
characters in files in the first place *and* expect to be able
to manipulate them on every OS too. (Just round-tripping works,
but e.g. tr $'\uEF80' $'\uEF81' will not work.)

I’ve tentatively assigned a 128-codepoint PUA range for this,
but then contacted the ConScript Unicode Registry which is a
voluntary agreement to reserve each others’ ranges, and asked
for a 128-codepoint assignment there (and got one, and also
registered some other PUA users from Linux and Apple with them).

MirBSD now maps \x80 to (wchar_t)0xEF80, \x81 to (wchar_t)0xEF81,
etc. up to \xFF to (wchar_t)0xEFFF, when converting from multibyte
to wide characters; additionally, the octet sequences ranging from
\xEE\xBE\x80 to \xEE\xBF\xBF which are (strictly spoken) valid UTF-8
are mapped to L"\uEFEE\uEFBE\uEF80" - L"\uEFEE\uEFBF\uEFBF".


Python 3 later had the same problem, and solved it in the same way,
although using a different range. Their system uses “Unicode” strings
throughoutly, which is fancy for wchar_t strings (and probably more
portable than the C equivalent), but one needs to be able to, for
example, call the open(2) equivalent with arbitrary 8-bit filenames
(as POSIX filenames are octet strings except NUL and slash). Their
mapping is defined in PEP 383, and instead of using the PUA they
map to the “second” half of UTF-16 surrogates, arguïng that surrogates
never occur unpaired in valid Unicode sequences. (I disagreed because
this makes the encoding stateful again, which was one major benefit
of Unicode to not have. Martin decided to not switch to the MirBSD
PUA mapping; we agreed to disagree there. Additionally, at the current
point in time, MirBSD is still (deliberately – to keep the implemen‐
tation small and suckless) “confined” to Unicode BMP, i.e. does n̲o̲t̲
use UTF-16 anyway, so we could not use their definition.)


The implementation mostly consists of one macro, two files, and
a type definition (which is inspired by Bruno Haible’s libutf8):

#define iswoctet(wc)    (((wchar_t)(wc) & 0xFF80) == 0xEF80)

https://www.mirbsd.org/cvs.cgi/src/kern/c/optu8to16.c?rev=HEAD
https://www.mirbsd.org/cvs.cgi/src/kern/c/optu16to8.c?rev=HEAD

typedef struct {
        unsigned int count:2;
        unsigned int value:12;
} __attribute__((__packed__)) mbstate_t;

Implementation note:
│ typedef short unsigned int wchar_t;
│ typedef unsigned int wint_t;
(This is not a requirement.)

The files are equivalents of wcrtomb (optu16to8; actually, in
MirBSD these two functions are precisely the same) respective
mbrtowc (optu8to16, but with a small API difference, so this
is not a plug-in replacement, for a reason).

When calling optu8to16 one byte at a time, it will store info
in the mbstate_t argument just like mbrtowc does, but upon
encountering some not-UTF-8 sequences, it needs to be able to
emit _several_ wide characters _without_ eating up any octets;
this makes for the API difference:

mbrtowc                         │optu8to16
════════════════════════════════╪═══════════════════════════════════════
-                               │When called with n == 0 it can still
                                │emit up to two wide characters, and
                                │will only thereafter return (size_t)-2.
────────────────────────────────┼───────────────────────────────────────
The s == NULL check is first.   │The check for n == 0 comes before the
                                │check for s == NULL so a caller can
                                │flush the state w/o needing to pass
                                │an input string.
────────────────────────────────┼───────────────────────────────────────
A return value of 0 means that  │Any return value not (size_t)-2 or -1
the most recent character was   │means “number of input octets eaten”.
a NUL character and terminates  │This implies a return value of 0 means
the string. Any other return    │“no input eaten but output was still
value not (size_t)-2 or -1 (or, │emitted”.
in the C11 case, (size_t)-3     │
which does not work with 16-bit │
wchar_t though) means “number   │
of input octets eaten”.         │

When calling mbrtowc and/or optu8to16 with enough input octets
to always form output (be the input UTF-8 or raw octets), there
will be no difference; otherwise, calling mbrtowc() will lose
the second or third byte of multi-byte invalid input. This is
the only known issue with this solution *and* will require some
code to be patched (to either always pass “enough” bytes or use
the optu8to16 function if present, e.g. from an autoconf check).

“Enough” bytes has an upper bound: 3 octets of input are, in
a̲l̲l̲ cases, enough to determine whether any given input is valid
UTF-8 or whether a PUA mapping needs to be emitted. (This is
for 16-bit Unicode; for 21-bit Unicode you would want to pass
at least 4 octets (and declare the 5- and 6-byte forms of UTF-8
invalid input).) In most cases, 1 or 2 octets will be enough,
but 3 will always work (in the “invalid input” case optu8to16
will just “eat” only one byte making a sliding 3-byte window).


The functions given are under a liberal enough licence (MirOS,
like MIT), but should this be a problem, talk to me directly.

I still reserve the right to “steer” the “OPTU-8” and “OPTU-16”
“encoding” (which is just fancy to say for “UTF-8/CESU-8 but
raw octets allowed” and “UCS-2 with specific meaning for the
PUA ares [EF80;EFFF]”; nl_langinfo(CODESET) returns "UTF-8" on
MirBSD of course).

This all has been extensively, and in some cases empirically
(i.e. with all possible input and state values), been tested,
but only with the constraint of supporting 16-bit Unicode.

(Note that the wchar_t values 0xFFFE and 0xFFFF are reserved
for the (size_t)-1 and (size_t)-2 results, and thus, their
corresponding UTF-8 encoding is considered invalid; but then,
e̲v̲e̲r̲y̲ system with a 16-bit wchar_t m̲u̲s̲t̲ do the same; those
just would need to throw EILSEQ instead of passing it through
transparently.)

HTH & HAND,
//mirabilos
-- 
13:37⎜«Natureshadow» Deep inside, I hate mirabilos. I mean, he's a good
guy. But he's always right! In every fsckin' situation, he's right. Even
with his deeply perverted taste in software and borked ambition towards
broken OSes - in the end, he's damn right about it :(! […] works in mksh

8-bit transparency in the C locale vs. UTF-8 support (was Re: [dev] [sbase][RFC] Add a simplistic version of tr)

Reply via email to