In the thread "From wchar_t to char32_t" we discussed the mbrtoc32 function,
in particular.

mbrtoc32, compared to mbrtowc, has two new features:
  (a) it overcomes wchar_t limitations, especially the fact that on Windows,
      wchar_t is only 16 bits wide.
  (b) it allows a multibyte sequence to be mapped to a sequence of char32_t
      characters, whereas mbrtowc maps a multibyte sequence to a single
      wchar_t (or returns an error).

With (a), we can satisfy
  Goal (A): Support non-BMP characters (such as Emojis) better on Windows,
            including Cygwin.

With (b), we could theoretically satisfy
  Goal (B): Support locales with BIG5-HKSCS encoding better.

However, (B) is a NON-GOAL.

1) Hardly anyone uses the BIG5-HKSCS encoding.

2) As we have found out, through the diffutils exercise and the 'dfa'
   module, supporting goal (B) means that

   * Applications need to distinguish places where it's OK to handle
     the several Unicode characters separately, such as in mbswidth,
     from places where the multibyte character has to be kept as a unit,
     and thus a wchar_t needs to be replaced not with a single char32_t
     but with a sequence of char32_t.

   * Accordingly, there is a need for two different modules 'mbchar' —
     one that produces a single Unicode character at a time, and one
     that produces a sequence of Unicode characters.

   * Likewise for the modules 'mbiter' and 'mbuiter'.

   This is basically the sort of complexity that we did NOT want to add
   for supporting Windows with mbrtowc.

3) It's also a testability problem. Code that is not tested is buggy,
   in general. There is no glibc version so far that implements the
   mbrtoc32 with BIG5-HKSCS encoding correctly; see
   <https://sourceware.org/bugzilla/show_bug.cgi?id=30611>.
   In order to test application code, we would have to write an alternate
   mbrtoc32 function which, for example, maps the 'ä' character to
   U+0041 U+0308.
   But this would be even more complexity, for the sake of a hypothetical
   scenario.

Paul seems to agree that this is a non-goal:
  - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00021.html
    "We don't have time to support every oddball coding system that POSIX
     allows."
  - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00026.html
    "And since it'll likely be a hassle to port the rest of the code to
     purely-theoretical platforms where nbytes == (size_t) -3, I suggest
     instead simply adding a comment that nbytes cannot be (size_t) -3 there."
  - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00032.html
    "you and I have already spent more time on theoretical platforms than
     they're likely worth"

Adding a comment would be a possibility. But we can do better by formalizing
the notion that we do NOT want (b).

DEFINITION: We call an mbrtoc32 function _regular_ if
  - It never returns (size_t)-3.
  - When it returns < (size_t)-2, the mbstate_t is in the initial state.

Here I'm adding a Gnulib module that provides a _regular_ mbrtoc32 function.
With a unit test. (Once we have formalized the notion, we can test it through
a unit test.)


2023-07-10  Bruno Haible  <br...@clisp.org>

        mbrtoc32-regular: Add tests.
        * tests/test-mbrtoc32-regular.c: New file.
        * modules/mbrtoc32-regular-tests: New file.

        mbrtoc32-regular: New module.
        * modules/mbrtoc32-regular: New file.
        * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
        and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
        * doc/posix-functions/mbrtoc32.texi: Mention the new module.

>From 0b55d1c3fbcb9bfa4b49a9aca16006294d118637 Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Tue, 11 Jul 2023 00:03:34 +0200
Subject: [PATCH 1/2] mbrtoc32-regular: New module.

* modules/mbrtoc32-regular: New file.
* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
* doc/posix-functions/mbrtoc32.texi: Mention the new module.
---
 ChangeLog                         |  8 ++++++++
 doc/posix-functions/mbrtoc32.texi | 24 +++++++++++++++---------
 lib/mbrtoc32.c                    |  9 +++++++++
 modules/mbrtoc32-regular          | 27 +++++++++++++++++++++++++++
 4 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 modules/mbrtoc32-regular

diff --git a/ChangeLog b/ChangeLog
index fdc8e42ad4..c8dc122aa4 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2023-07-10  Bruno Haible  <br...@clisp.org>
+
+	mbrtoc32-regular: New module.
+	* modules/mbrtoc32-regular: New file.
+	* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
+	and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
+	* doc/posix-functions/mbrtoc32.texi: Mention the new module.
+
 2023-07-10  Bruno Haible  <br...@clisp.org>
 
 	Apply the last change to all locale-*.m4 files.
diff --git a/doc/posix-functions/mbrtoc32.texi b/doc/posix-functions/mbrtoc32.texi
index 3528114bec..9690dd047d 100644
--- a/doc/posix-functions/mbrtoc32.texi
+++ b/doc/posix-functions/mbrtoc32.texi
@@ -2,9 +2,9 @@
 @section @code{mbrtoc32}
 @findex mbrtoc32
 
-Gnulib module: mbrtoc32
+Gnulib module: mbrtoc32 or mbrtoc32-regular
 
-Portability problems fixed by Gnulib:
+Portability problems fixed by either Gnulib module @code{mbrtoc32} or @code{mbrtoc32-regular}:
 @itemize
 @item
 This function is missing on most non-glibc platforms:
@@ -35,19 +35,25 @@
 @c See https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/mbrtoc16-mbrtoc323
 @end itemize
 
-Portability problems not fixed by Gnulib:
+Portability problems fixed by Gnulib module @code{mbrtoc32-regular}:
 @itemize
 @item
+This function can map some multibyte characters to a sequence of two or more
+Unicode characters, and may thus return @code{(size_t) -3}.
+No known implementation currently (2023) behaves that way, but it may
+theoretically happen.
+With the @code{mbrtoc32-regular} module, you have the guarantee that the
+Gnulib-provided @code{mbrtoc32} function maps each multibyte character to
+exactly one Unicode character and thus never returns @code{(size_t) -3}.
+@item
 This function behaves incorrectly when converting precomposed characters
 from the BIG5-HKSCS encoding:
 @c https://sourceware.org/bugzilla/show_bug.cgi?id=30611
 glibc 2.36.
-@item
-Although ISO C says this function can return @code{(size_t) -3},
-no known implementation behaves that way,
-and if it were to happen it would break common uses.
-If dealing with @code{(size_t) -3} would complicate your code significantly,
-it is probably better not to bother.
+@end itemize
+
+Portability problems not fixed by Gnulib:
+@itemize
 @item
 This function is only defined as an inline function on some platforms:
 Haiku 2020.
diff --git a/lib/mbrtoc32.c b/lib/mbrtoc32.c
index 6a56d93a4b..96039f9480 100644
--- a/lib/mbrtoc32.c
+++ b/lib/mbrtoc32.c
@@ -126,6 +126,15 @@ mbrtoc32 (char32_t *pwc, const char *s, size_t n, mbstate_t *ps)
   size_t ret = mbrtoc32 (pwc, s, n, ps);
 #  endif
 
+#  if GNULIB_MBRTOC32_REGULAR
+  /* Verify that mbrtoc32 is regular.  */
+  if (ret < (size_t) -3 && ! mbsinit (ps))
+    /* This occurs on glibc 2.36.  */
+    memset (ps, '\0', sizeof (mbstate_t));
+  if (ret == (size_t) -3)
+    abort ();
+#  endif
+
 #  if MBRTOC32_IN_C_LOCALE_MAYBE_EILSEQ
   if ((size_t) -2 <= ret && n != 0 && ! hard_locale (LC_CTYPE))
     {
diff --git a/modules/mbrtoc32-regular b/modules/mbrtoc32-regular
new file mode 100644
index 0000000000..e8ae236fc5
--- /dev/null
+++ b/modules/mbrtoc32-regular
@@ -0,0 +1,27 @@
+Description:
+mbrtoc32() function that maps each multibyte character to exactly one Unicode
+character and thus never returns (size_t)(-3).
+
+Files:
+
+Depends-on:
+mbrtoc32
+
+configure.ac:
+gl_MODULE_INDICATOR([mbrtoc32-regular])
+
+Makefile.am:
+
+Include:
+<uchar.h>
+
+Link:
+$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise
+$(MBRTOWC_LIB)
+$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise
+
+License:
+LGPLv2+
+
+Maintainer:
+Bruno Haible
-- 
2.34.1

>From 2d46fcdd3fa38139f3c3b6cbc3439363553ee0e7 Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Tue, 11 Jul 2023 00:06:14 +0200
Subject: [PATCH 2/2] mbrtoc32-regular: Add tests.

* tests/test-mbrtoc32-regular.c: New file.
* modules/mbrtoc32-regular-tests: New file.
---
 ChangeLog                      |  4 ++
 modules/mbrtoc32-regular-tests | 14 ++++++
 tests/test-mbrtoc32-regular.c  | 79 ++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+)
 create mode 100644 modules/mbrtoc32-regular-tests
 create mode 100644 tests/test-mbrtoc32-regular.c

diff --git a/ChangeLog b/ChangeLog
index c8dc122aa4..3eb2e2bc4b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
 2023-07-10  Bruno Haible  <br...@clisp.org>
 
+	mbrtoc32-regular: Add tests.
+	* tests/test-mbrtoc32-regular.c: New file.
+	* modules/mbrtoc32-regular-tests: New file.
+
 	mbrtoc32-regular: New module.
 	* modules/mbrtoc32-regular: New file.
 	* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
diff --git a/modules/mbrtoc32-regular-tests b/modules/mbrtoc32-regular-tests
new file mode 100644
index 0000000000..907f73721a
--- /dev/null
+++ b/modules/mbrtoc32-regular-tests
@@ -0,0 +1,14 @@
+Files:
+tests/test-mbrtoc32-regular.c
+tests/macros.h
+
+Depends-on:
+mbsinit
+setlocale
+
+configure.ac:
+
+Makefile.am:
+TESTS += test-mbrtoc32-regular
+check_PROGRAMS += test-mbrtoc32-regular
+test_mbrtoc32_regular_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
diff --git a/tests/test-mbrtoc32-regular.c b/tests/test-mbrtoc32-regular.c
new file mode 100644
index 0000000000..a85a0a5a69
--- /dev/null
+++ b/tests/test-mbrtoc32-regular.c
@@ -0,0 +1,79 @@
+/* Test of conversion of multibyte character to 32-bit wide character.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <br...@clisp.org>, 2023.  */
+
+#include <config.h>
+
+#include <uchar.h>
+
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <uchar.h>
+#include <wchar.h>
+
+#include "macros.h"
+
+int
+main (int argc, char *argv[])
+{
+  /* The only locales in which mbrtoc32 may map a multibyte character to a
+     sequence of two or more Unicode characters are those with BIG5-HKSCS
+     encoding.  See
+     <https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html>
+     <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00014.html>  */
+  if (setlocale (LC_ALL, "zh_HK.BIG5-HKSCS") == NULL)
+    {
+      fprintf (stderr, "Skipping test: found no locale with BIG5-HKSCS encoding.\n");
+      return 77;
+    }
+
+  /* The problematic BIG5-HKSCS characters are:
+
+       input         maps to                          name
+       -----         -------                          ----
+     0x88 0x62    U+00CA U+0304    LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON
+     0x88 0x64    U+00CA U+030C    LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON
+     0x88 0xA3    U+00EA U+0304    LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON
+     0x88 0xA5    U+00EA U+030C    LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON
+
+     Test one of them.
+     See <https://sourceware.org/bugzilla/show_bug.cgi?id=30611>.  */
+  mbstate_t state;
+  memset (&state, '\0', sizeof (mbstate_t));
+  char32_t c32 = (char32_t) 0xBADFACE;
+  size_t ret = mbrtoc32 (&c32, "\210\142", 2, &state);
+  /* It is OK if this conversion fails.  */
+  if (ret != (size_t)(-1))
+    {
+      /* mbrtoc32 being regular, means that STATE is in the initial state.  */
+      ASSERT (mbsinit (&state));
+      ret = mbrtoc32 (&c32, "", 0, &state);
+      /* mbrtoc32 being regular, means that it returns (size_t)(-2), not
+         (size_t)(-3), here.  */
+      ASSERT (ret == (size_t)(-2));
+      ret = mbrtoc32 (&c32, "", 1, &state);
+      /* mbrtoc32 being regular, means that it returns the null 32-bit wide
+         character, here, not any remnant from the previous multibyte
+         character.  */
+      ASSERT (ret == 0);
+      ASSERT (c32 == 0);
+    }
+
+  return 0;
+}
-- 
2.34.1

Reply via email to