In the thread "From wchar_t to char32_t" we discussed the mbrtoc32 function, in particular.
mbrtoc32, compared to mbrtowc, has two new features: (a) it overcomes wchar_t limitations, especially the fact that on Windows, wchar_t is only 16 bits wide. (b) it allows a multibyte sequence to be mapped to a sequence of char32_t characters, whereas mbrtowc maps a multibyte sequence to a single wchar_t (or returns an error). With (a), we can satisfy Goal (A): Support non-BMP characters (such as Emojis) better on Windows, including Cygwin. With (b), we could theoretically satisfy Goal (B): Support locales with BIG5-HKSCS encoding better. However, (B) is a NON-GOAL. 1) Hardly anyone uses the BIG5-HKSCS encoding. 2) As we have found out, through the diffutils exercise and the 'dfa' module, supporting goal (B) means that * Applications need to distinguish places where it's OK to handle the several Unicode characters separately, such as in mbswidth, from places where the multibyte character has to be kept as a unit, and thus a wchar_t needs to be replaced not with a single char32_t but with a sequence of char32_t. * Accordingly, there is a need for two different modules 'mbchar' — one that produces a single Unicode character at a time, and one that produces a sequence of Unicode characters. * Likewise for the modules 'mbiter' and 'mbuiter'. This is basically the sort of complexity that we did NOT want to add for supporting Windows with mbrtowc. 3) It's also a testability problem. Code that is not tested is buggy, in general. There is no glibc version so far that implements the mbrtoc32 with BIG5-HKSCS encoding correctly; see <https://sourceware.org/bugzilla/show_bug.cgi?id=30611>. In order to test application code, we would have to write an alternate mbrtoc32 function which, for example, maps the 'ä' character to U+0041 U+0308. But this would be even more complexity, for the sake of a hypothetical scenario. Paul seems to agree that this is a non-goal: - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00021.html "We don't have time to support every oddball coding system that POSIX allows." - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00026.html "And since it'll likely be a hassle to port the rest of the code to purely-theoretical platforms where nbytes == (size_t) -3, I suggest instead simply adding a comment that nbytes cannot be (size_t) -3 there." - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00032.html "you and I have already spent more time on theoretical platforms than they're likely worth" Adding a comment would be a possibility. But we can do better by formalizing the notion that we do NOT want (b). DEFINITION: We call an mbrtoc32 function _regular_ if - It never returns (size_t)-3. - When it returns < (size_t)-2, the mbstate_t is in the initial state. Here I'm adding a Gnulib module that provides a _regular_ mbrtoc32 function. With a unit test. (Once we have formalized the notion, we can test it through a unit test.) 2023-07-10 Bruno Haible <br...@clisp.org> mbrtoc32-regular: Add tests. * tests/test-mbrtoc32-regular.c: New file. * modules/mbrtoc32-regular-tests: New file. mbrtoc32-regular: New module. * modules/mbrtoc32-regular: New file. * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present and the system's mbrtoc32 returned a char32_t, clear the mbstate_t. * doc/posix-functions/mbrtoc32.texi: Mention the new module.
>From 0b55d1c3fbcb9bfa4b49a9aca16006294d118637 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Tue, 11 Jul 2023 00:03:34 +0200 Subject: [PATCH 1/2] mbrtoc32-regular: New module. * modules/mbrtoc32-regular: New file. * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present and the system's mbrtoc32 returned a char32_t, clear the mbstate_t. * doc/posix-functions/mbrtoc32.texi: Mention the new module. --- ChangeLog | 8 ++++++++ doc/posix-functions/mbrtoc32.texi | 24 +++++++++++++++--------- lib/mbrtoc32.c | 9 +++++++++ modules/mbrtoc32-regular | 27 +++++++++++++++++++++++++++ 4 files changed, 59 insertions(+), 9 deletions(-) create mode 100644 modules/mbrtoc32-regular diff --git a/ChangeLog b/ChangeLog index fdc8e42ad4..c8dc122aa4 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,11 @@ +2023-07-10 Bruno Haible <br...@clisp.org> + + mbrtoc32-regular: New module. + * modules/mbrtoc32-regular: New file. + * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present + and the system's mbrtoc32 returned a char32_t, clear the mbstate_t. + * doc/posix-functions/mbrtoc32.texi: Mention the new module. + 2023-07-10 Bruno Haible <br...@clisp.org> Apply the last change to all locale-*.m4 files. diff --git a/doc/posix-functions/mbrtoc32.texi b/doc/posix-functions/mbrtoc32.texi index 3528114bec..9690dd047d 100644 --- a/doc/posix-functions/mbrtoc32.texi +++ b/doc/posix-functions/mbrtoc32.texi @@ -2,9 +2,9 @@ @section @code{mbrtoc32} @findex mbrtoc32 -Gnulib module: mbrtoc32 +Gnulib module: mbrtoc32 or mbrtoc32-regular -Portability problems fixed by Gnulib: +Portability problems fixed by either Gnulib module @code{mbrtoc32} or @code{mbrtoc32-regular}: @itemize @item This function is missing on most non-glibc platforms: @@ -35,19 +35,25 @@ @c See https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/mbrtoc16-mbrtoc323 @end itemize -Portability problems not fixed by Gnulib: +Portability problems fixed by Gnulib module @code{mbrtoc32-regular}: @itemize @item +This function can map some multibyte characters to a sequence of two or more +Unicode characters, and may thus return @code{(size_t) -3}. +No known implementation currently (2023) behaves that way, but it may +theoretically happen. +With the @code{mbrtoc32-regular} module, you have the guarantee that the +Gnulib-provided @code{mbrtoc32} function maps each multibyte character to +exactly one Unicode character and thus never returns @code{(size_t) -3}. +@item This function behaves incorrectly when converting precomposed characters from the BIG5-HKSCS encoding: @c https://sourceware.org/bugzilla/show_bug.cgi?id=30611 glibc 2.36. -@item -Although ISO C says this function can return @code{(size_t) -3}, -no known implementation behaves that way, -and if it were to happen it would break common uses. -If dealing with @code{(size_t) -3} would complicate your code significantly, -it is probably better not to bother. +@end itemize + +Portability problems not fixed by Gnulib: +@itemize @item This function is only defined as an inline function on some platforms: Haiku 2020. diff --git a/lib/mbrtoc32.c b/lib/mbrtoc32.c index 6a56d93a4b..96039f9480 100644 --- a/lib/mbrtoc32.c +++ b/lib/mbrtoc32.c @@ -126,6 +126,15 @@ mbrtoc32 (char32_t *pwc, const char *s, size_t n, mbstate_t *ps) size_t ret = mbrtoc32 (pwc, s, n, ps); # endif +# if GNULIB_MBRTOC32_REGULAR + /* Verify that mbrtoc32 is regular. */ + if (ret < (size_t) -3 && ! mbsinit (ps)) + /* This occurs on glibc 2.36. */ + memset (ps, '\0', sizeof (mbstate_t)); + if (ret == (size_t) -3) + abort (); +# endif + # if MBRTOC32_IN_C_LOCALE_MAYBE_EILSEQ if ((size_t) -2 <= ret && n != 0 && ! hard_locale (LC_CTYPE)) { diff --git a/modules/mbrtoc32-regular b/modules/mbrtoc32-regular new file mode 100644 index 0000000000..e8ae236fc5 --- /dev/null +++ b/modules/mbrtoc32-regular @@ -0,0 +1,27 @@ +Description: +mbrtoc32() function that maps each multibyte character to exactly one Unicode +character and thus never returns (size_t)(-3). + +Files: + +Depends-on: +mbrtoc32 + +configure.ac: +gl_MODULE_INDICATOR([mbrtoc32-regular]) + +Makefile.am: + +Include: +<uchar.h> + +Link: +$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise +$(MBRTOWC_LIB) +$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise + +License: +LGPLv2+ + +Maintainer: +Bruno Haible -- 2.34.1
>From 2d46fcdd3fa38139f3c3b6cbc3439363553ee0e7 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Tue, 11 Jul 2023 00:06:14 +0200 Subject: [PATCH 2/2] mbrtoc32-regular: Add tests. * tests/test-mbrtoc32-regular.c: New file. * modules/mbrtoc32-regular-tests: New file. --- ChangeLog | 4 ++ modules/mbrtoc32-regular-tests | 14 ++++++ tests/test-mbrtoc32-regular.c | 79 ++++++++++++++++++++++++++++++++++ 3 files changed, 97 insertions(+) create mode 100644 modules/mbrtoc32-regular-tests create mode 100644 tests/test-mbrtoc32-regular.c diff --git a/ChangeLog b/ChangeLog index c8dc122aa4..3eb2e2bc4b 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,9 @@ 2023-07-10 Bruno Haible <br...@clisp.org> + mbrtoc32-regular: Add tests. + * tests/test-mbrtoc32-regular.c: New file. + * modules/mbrtoc32-regular-tests: New file. + mbrtoc32-regular: New module. * modules/mbrtoc32-regular: New file. * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present diff --git a/modules/mbrtoc32-regular-tests b/modules/mbrtoc32-regular-tests new file mode 100644 index 0000000000..907f73721a --- /dev/null +++ b/modules/mbrtoc32-regular-tests @@ -0,0 +1,14 @@ +Files: +tests/test-mbrtoc32-regular.c +tests/macros.h + +Depends-on: +mbsinit +setlocale + +configure.ac: + +Makefile.am: +TESTS += test-mbrtoc32-regular +check_PROGRAMS += test-mbrtoc32-regular +test_mbrtoc32_regular_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV) diff --git a/tests/test-mbrtoc32-regular.c b/tests/test-mbrtoc32-regular.c new file mode 100644 index 0000000000..a85a0a5a69 --- /dev/null +++ b/tests/test-mbrtoc32-regular.c @@ -0,0 +1,79 @@ +/* Test of conversion of multibyte character to 32-bit wide character. + Copyright (C) 2023 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <br...@clisp.org>, 2023. */ + +#include <config.h> + +#include <uchar.h> + +#include <locale.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <uchar.h> +#include <wchar.h> + +#include "macros.h" + +int +main (int argc, char *argv[]) +{ + /* The only locales in which mbrtoc32 may map a multibyte character to a + sequence of two or more Unicode characters are those with BIG5-HKSCS + encoding. See + <https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html> + <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00014.html> */ + if (setlocale (LC_ALL, "zh_HK.BIG5-HKSCS") == NULL) + { + fprintf (stderr, "Skipping test: found no locale with BIG5-HKSCS encoding.\n"); + return 77; + } + + /* The problematic BIG5-HKSCS characters are: + + input maps to name + ----- ------- ---- + 0x88 0x62 U+00CA U+0304 LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON + 0x88 0x64 U+00CA U+030C LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON + 0x88 0xA3 U+00EA U+0304 LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON + 0x88 0xA5 U+00EA U+030C LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON + + Test one of them. + See <https://sourceware.org/bugzilla/show_bug.cgi?id=30611>. */ + mbstate_t state; + memset (&state, '\0', sizeof (mbstate_t)); + char32_t c32 = (char32_t) 0xBADFACE; + size_t ret = mbrtoc32 (&c32, "\210\142", 2, &state); + /* It is OK if this conversion fails. */ + if (ret != (size_t)(-1)) + { + /* mbrtoc32 being regular, means that STATE is in the initial state. */ + ASSERT (mbsinit (&state)); + ret = mbrtoc32 (&c32, "", 0, &state); + /* mbrtoc32 being regular, means that it returns (size_t)(-2), not + (size_t)(-3), here. */ + ASSERT (ret == (size_t)(-2)); + ret = mbrtoc32 (&c32, "", 1, &state); + /* mbrtoc32 being regular, means that it returns the null 32-bit wide + character, here, not any remnant from the previous multibyte + character. */ + ASSERT (ret == 0); + ASSERT (c32 == 0); + } + + return 0; +} -- 2.34.1