Paul Eggert wrote: > On 2023-07-02 06:33, Bruno Haible wrote: > > + else if (bytes == (size_t) -3) > > + bytes = 0; > > Why is this sort of thing needed?
I tried to explain it in https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html . Basically, since ISO C 23 says that mbrtoc32() can return (size_t) -3, I want to write future-proof code by handling this case. Even though currently no implementation produces this return code, and I consider it unlikely that any implementation ever will. > I thought that (size_t) -3 was > possible only after a low surrogate, which is possible when decoding > valid UTF-16 to Unicode The mbrtoc16() function returns (size_t) -3 when it stores a low surrogate (as second char16_t after the first one was a high surrogate), right. But we don't use the mbrtoc16() function, and I don't plan to use it, ever. The mbrtoc32() function MUST return (size_t) -1 and errno = EILSEQ when the input is an UTF-8 byte sequence whose value would be a surrogate. See https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling > but not when decoding valid UTF-8 to Unicode. When decoding valid or invalid UTF-8 through mbrtoc32(), (size_t) -3 can never occur. > When can we get (size_t) -3 in a real-world system? It can/could occur if all of the following conditions are met: * The locale encoding is BIG5-HKSCS, e.g. on a glibc system the zh_HK.BIG5-HKSCS the locale. * The input is one of the 4 characters in that encoding that map to a sequence of two Unicode characters: input maps to ----- ------- 0x88 0x62 U+00CA U+0304 0x88 0x64 U+00CA U+030C 0x88 0xA3 U+00EA U+0304 0x88 0xA5 U+00EA U+030C * glibc is changed so that, in this case, mbrtoc32() does not work identically to mbrtowc(). * The other glibc bug that causes gnulib to override mbrtoc32 gets fixed: https://sourceware.org/bugzilla/show_bug.cgi?id=19932 https://sourceware.org/bugzilla/show_bug.cgi?id=29511 I consider this unlikely. It is more likely that glibc's behaviour does not change, or that the zh_HK.BIG5-HKSCS locale becomes unsupported. > If (size_t) -3 is possible, I suppose I should change diffutils to take > this into account, as bleeding-edge diffutils/src/side.c treats (size_t) > -3 as meaning the next input byte is an encoding error, which is > obviously wrong. If you want the diffutils code to be future-proof, yes. > The simplest way to fix this would be for diffutils to > go back to using wchar_t, ?? We are talking about 2 lines of code, which lead to 2 instructions at run time. If you want to micro-optimize execution time, you could conditionally disable these 2 lines, until we know that the problem will actually occur with glibc. > although I don't know what the downsides of > that would be (diffutils doesn't care about Unicode; all it cares is > about is character classes and print widths). With plain wchar_t (as opposed to char32_t), character classes and print widths of non-BMP characters come out wrong on Cygwin, native Windows, and 32-bit AIX. [1] Bruno [1] https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00102.html