Re: From wchar_t to char32_t

Bruno Haible Sun, 02 Jul 2023 13:19:21 -0700

Paul Eggert wrote:
> On 2023-07-02 06:33, Bruno Haible wrote:
> > +                    else if (bytes == (size_t) -3)
> > +                      bytes = 0;
> 
> Why is this sort of thing needed?


I tried to explain it in
https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html .

Basically, since ISO C 23 says that mbrtoc32() can return (size_t) -3,
I want to write future-proof code by handling this case. Even though
currently no implementation produces this return code, and I consider
it unlikely that any implementation ever will.

> I thought that (size_t) -3 was 
> possible only after a low surrogate, which is possible when decoding 
> valid UTF-16 to Unicode

The mbrtoc16() function returns (size_t) -3 when it stores a low surrogate
(as second char16_t after the first one was a high surrogate), right.
But we don't use the mbrtoc16() function, and I don't plan to use it, ever.

The mbrtoc32() function MUST return (size_t) -1 and errno = EILSEQ when
the input is an UTF-8 byte sequence whose value would be a surrogate.
See https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling

> but not when decoding valid UTF-8 to Unicode.

When decoding valid or invalid UTF-8 through mbrtoc32(), (size_t) -3
can never occur.

> When can we get (size_t) -3 in a real-world system?

It can/could occur if all of the following conditions are met:

  * The locale encoding is BIG5-HKSCS, e.g. on a glibc system the
    zh_HK.BIG5-HKSCS the locale.

  * The input is one of the 4 characters in that encoding that map to
    a sequence of two Unicode characters:

       input         maps to
       -----         -------
     0x88 0x62    U+00CA U+0304
     0x88 0x64    U+00CA U+030C
     0x88 0xA3    U+00EA U+0304
     0x88 0xA5    U+00EA U+030C

  * glibc is changed so that, in this case, mbrtoc32() does not work
    identically to mbrtowc().

  * The other glibc bug that causes gnulib to override mbrtoc32 gets fixed:
    https://sourceware.org/bugzilla/show_bug.cgi?id=19932
    https://sourceware.org/bugzilla/show_bug.cgi?id=29511

I consider this unlikely. It is more likely that glibc's behaviour does not
change, or that the zh_HK.BIG5-HKSCS locale becomes unsupported.

> If (size_t) -3 is possible, I suppose I should change diffutils to take 
> this into account, as bleeding-edge diffutils/src/side.c treats (size_t) 
> -3 as meaning the next input byte is an encoding error, which is 
> obviously wrong.

If you want the diffutils code to be future-proof, yes.

> The simplest way to fix this would be for diffutils to 
> go back to using wchar_t,

?? We are talking about 2 lines of code, which lead to 2 instructions at
run time. If you want to micro-optimize execution time, you could
conditionally disable these 2 lines, until we know that the problem will
actually occur with glibc.

> although I don't know what the downsides of 
> that would be (diffutils doesn't care about Unicode; all it cares is 
> about is character classes and print widths).

With plain wchar_t (as opposed to char32_t), character classes and print widths
of non-BMP characters come out wrong on Cygwin, native Windows, and 32-bit AIX.
[1]

Bruno

[1] https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00102.html

Re: From wchar_t to char32_t

Reply via email to