Re: From wchar_t to char32_t

Bruno Haible Sat, 01 Jul 2023 07:35:52 -0700

Here is a proposed patch to overcome the wchar_t limitation in the 'dfa'
module.


Jim: The background is explained in
<https://www.gnu.org/software/gnulib/manual/html_node/Strings-and-Characters.html>
The plan was exposed in
    <https://lists.gnu.org/archive/html/bug-gnulib/2018-12/msg00118.html>
and <https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00102.html>

The 'grep' code needs a minimal change, accordingly. (Attached.)

I have verified that this change does not cause test failures in 'grep'
and in 'sed', on glibc systems, FreeBSD, and Solaris 11.4.

Arnold: I have added '#if GAWK' conditionals, knowing that gawk's build system
does not use gnulib-tool and you therefore pull from gnulib manually. This
means the improvements will not land in gawk, since dfa in gawk will continue
to use wchar_t.

Objections?


2023-07-01  Bruno Haible  <br...@clisp.org>

        dfa: Overcome wchar_t limitations.
        * lib/localeinfo.h: Include <uchar.h>. Add special definitions for GAWK.
        (case_folded_counterparts): Change array element type to char32_t.
        * lib/localeinfo.c: Include <uchar.h>. Add special definitions for GAWK.
        (is_using_utf8, init_localeinfo): Use mbrtoc32 instead of mbrtowc.
        (lonesome_lower): Change element type to 'unsigned short'.
        (case_folded_counterparts): Change array element type to char32_t. Use
        c32toupper instead of towupper. Use c32tolower instead of towlower.
        * lib/dfa.c: Include <uchar.h>. Add special definitions for GAWK.
        (struct mb_char_classes): Change element type of 'chars' to char32_t.
        (mbs_to_wchar): Use mbrtoc32 instead of mbrtowc.
        (setbit_wc): Change type of first argument to char32_t. Use c32tob
        instead of wctob.
        (parse_bracket_exp): Update.
        (lex): Use c32isprint instead of iswprint. Use c32isspace instead of
        iswspace. Use c32rtomb instead of a %lc directive.
        (addtok_wc): Use c32rtomb instead of wcrtomb.
        (atom): Update.
        * modules/dfa (Depends-on): Remove wctype-h. Add uchar, mbrtoc32,
        c32rtomb, c32tob, c32tolower, c32toupper, c32isprint, c32isspace.
        (Link): Add $(LIBUNISTRING) $(LIBC32CONV).
        * modules/dfa-tests (Makefile.am): Link test-dfa-match-aux with
        $(LIBUNISTRING) $(LIBC32CONV).
        * NEWS: Mention the change.

>From b0b542347103c0564e0cc168dc89d282f7e510c7 Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Sat, 1 Jul 2023 14:50:51 +0200
Subject: [PATCH] dfa: Overcome wchar_t limitations.

* lib/localeinfo.h: Include <uchar.h>. Add special definitions for GAWK.
(case_folded_counterparts): Change array element type to char32_t.
* lib/localeinfo.c: Include <uchar.h>. Add special definitions for GAWK.
(is_using_utf8, init_localeinfo): Use mbrtoc32 instead of mbrtowc.
(lonesome_lower): Change element type to 'unsigned short'.
(case_folded_counterparts): Change array element type to char32_t. Use
c32toupper instead of towupper. Use c32tolower instead of towlower.
* lib/dfa.c: Include <uchar.h>. Add special definitions for GAWK.
(struct mb_char_classes): Change element type of 'chars' to char32_t.
(mbs_to_wchar): Use mbrtoc32 instead of mbrtowc.
(setbit_wc): Change type of first argument to char32_t. Use c32tob
instead of wctob.
(parse_bracket_exp): Update.
(lex): Use c32isprint instead of iswprint. Use c32isspace instead of
iswspace. Use c32rtomb instead of a %lc directive.
(addtok_wc): Use c32rtomb instead of wcrtomb.
(atom): Update.
* modules/dfa (Depends-on): Remove wctype-h. Add uchar, mbrtoc32,
c32rtomb, c32tob, c32tolower, c32toupper, c32isprint, c32isspace.
(Link): Add $(LIBUNISTRING) $(LIBC32CONV).
* modules/dfa-tests (Makefile.am): Link test-dfa-match-aux with
$(LIBUNISTRING) $(LIBC32CONV).
* NEWS: Mention the change.
---
 ChangeLog         | 27 ++++++++++++++++
 NEWS              |  4 +++
 lib/dfa.c         | 79 ++++++++++++++++++++++++++++++-----------------
 lib/localeinfo.c  | 34 +++++++++++++-------
 lib/localeinfo.h  | 13 ++++++--
 modules/dfa       | 15 ++++++++-
 modules/dfa-tests |  2 +-
 7 files changed, 129 insertions(+), 45 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 10bf606af7..1840acffd7 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,30 @@
+2023-07-01  Bruno Haible  <br...@clisp.org>
+
+	dfa: Overcome wchar_t limitations.
+	* lib/localeinfo.h: Include <uchar.h>. Add special definitions for GAWK.
+	(case_folded_counterparts): Change array element type to char32_t.
+	* lib/localeinfo.c: Include <uchar.h>. Add special definitions for GAWK.
+	(is_using_utf8, init_localeinfo): Use mbrtoc32 instead of mbrtowc.
+	(lonesome_lower): Change element type to 'unsigned short'.
+	(case_folded_counterparts): Change array element type to char32_t. Use
+	c32toupper instead of towupper. Use c32tolower instead of towlower.
+	* lib/dfa.c: Include <uchar.h>. Add special definitions for GAWK.
+	(struct mb_char_classes): Change element type of 'chars' to char32_t.
+	(mbs_to_wchar): Use mbrtoc32 instead of mbrtowc.
+	(setbit_wc): Change type of first argument to char32_t. Use c32tob
+	instead of wctob.
+	(parse_bracket_exp): Update.
+	(lex): Use c32isprint instead of iswprint. Use c32isspace instead of
+	iswspace. Use c32rtomb instead of a %lc directive.
+	(addtok_wc): Use c32rtomb instead of wcrtomb.
+	(atom): Update.
+	* modules/dfa (Depends-on): Remove wctype-h. Add uchar, mbrtoc32,
+	c32rtomb, c32tob, c32tolower, c32toupper, c32isprint, c32isspace.
+	(Link): Add $(LIBUNISTRING) $(LIBC32CONV).
+	* modules/dfa-tests (Makefile.am): Link test-dfa-match-aux with
+	$(LIBUNISTRING) $(LIBC32CONV).
+	* NEWS: Mention the change.
+
 2023-07-01  Bruno Haible  <br...@clisp.org>
 
 	doc: Update regarding stable branches.
diff --git a/NEWS b/NEWS
index cfc8fe113a..0c55a12356 100644
--- a/NEWS
+++ b/NEWS
@@ -74,6 +74,10 @@ User visible incompatible changes
 
 Date        Modules         Changes
 
+2023-07-01  dfa             The signature of the function
+                            case_folded_counterparts, declared in localeinfo.h,
+                            has changed.
+
 2023-06-10  javacomp-script  These modules now compile the Java code with option
             javacomp          '-source 1.6' or higher. As a consequence, the
                              compiler may emit notes "... uses unchecked or
diff --git a/lib/dfa.c b/lib/dfa.c
index f1bab73059..0c50dcc956 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -35,6 +35,27 @@
 #include <stdlib.h>
 #include <limits.h>
 #include <string.h>
+#include <wchar.h>
+
+#include "xalloc.h"
+#include "localeinfo.h"
+
+#include "gettext.h"
+#define _(str) gettext (str)
+
+#if GAWK
+/* Use ISO C 99 API.  */
+# include <wctype.h>
+# define char32_t wchar_t
+# define mbrtoc32 mbrtowc
+# define c32rtomb wcrtomb
+# define c32tob wctob
+# define c32isprint iswprint
+# define c32isspace iswspace
+#else
+/* Use ISO C 11 + gnulib API.  */
+# include <uchar.h>
+#endif
 
 /* Pacify gcc -Wanalyzer-null-dereference in areas where GCC
    understandably cannot deduce that the input comes from a
@@ -55,15 +76,6 @@ c_isdigit (char c)
   return '0' <= c && c <= '9';
 }
 
-#include "gettext.h"
-#define _(str) gettext (str)
-
-#include <wchar.h>
-#include <wctype.h>
-
-#include "xalloc.h"
-#include "localeinfo.h"
-
 #ifndef FALLTHROUGH
 # if 201710L < __STDC_VERSION__
 #  define FALLTHROUGH [[__fallthrough__]]
@@ -300,8 +312,8 @@ enum
 
   RPAREN,                       /* RPAREN never appears in the parse tree.  */
 
-  WCHAR,                        /* Only returned by lex.  wctok contains
-                                   the wide character representation.  */
+  WCHAR,                        /* Only returned by lex.  wctok contains the
+                                   32-bit wide character representation.  */
 
   ANYCHAR,                      /* ANYCHAR is a terminal symbol that matches
                                    a valid multibyte (or single byte) character.
@@ -394,7 +406,7 @@ struct mb_char_classes
 {
   ptrdiff_t cset;
   bool invert;
-  wchar_t *chars;               /* Normal characters.  */
+  char32_t *chars;              /* Normal characters.  */
   idx_t nchars;
   idx_t nchars_alloc;
 };
@@ -438,7 +450,7 @@ struct lexer_state
   idx_t parens;		/* Count of outstanding left parens.  */
   int minrep, maxrep;	/* Repeat counts for {m,n}.  */
 
-  /* Wide character representation of the current multibyte character,
+  /* 32-bit wide character representation of the current multibyte character,
      or WEOF if there was an encoding error.  Used only if
      MB_CUR_MAX > 1.  */
   wint_t wctok;
@@ -621,9 +633,9 @@ static void regexp (struct dfa *dfa);
    convert just a single byte, to WEOF.  Return the number of bytes
    converted.
 
-   This differs from mbrtowc (PWC, S, N, &D->mbs) as follows:
+   This differs from mbrtoc32 (PWC, S, N, &D->mbs) as follows:
 
-   * PWC points to wint_t, not to wchar_t.
+   * PWC points to wint_t, not to char32_t.
    * The last arg is a dfa *D instead of merely a multibyte conversion
      state D->mbs.
    * N is idx_t not size_t, and must be at least 1.
@@ -640,11 +652,13 @@ mbs_to_wchar (wint_t *pwc, char const *s, idx_t n, struct dfa *d)
 
   if (wc == WEOF)
     {
-      wchar_t wch;
-      size_t nbytes = mbrtowc (&wch, s, n, &d->mbs);
+      char32_t wch;
+      size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs);
       if (0 < nbytes && nbytes < (size_t) -2)
         {
           *pwc = wch;
+          if (nbytes == (size_t) -3)
+            nbytes = 0;
           return nbytes;
         }
       memset (&d->mbs, 0, sizeof d->mbs);
@@ -844,15 +858,15 @@ char_context (struct dfa const *dfa, unsigned char c)
   return CTX_NONE;
 }
 
-/* Set a bit in the charclass for the given wchar_t.  Do nothing if WC
+/* Set a bit in the charclass for the given char32_t.  Do nothing if WC
    is represented by a multi-byte sequence.  Even for MB_CUR_MAX == 1,
    this may happen when folding case in weird Turkish locales where
    dotless i/dotted I are not included in the chosen character set.
    Return whether a bit was set in the charclass.  */
 static bool
-setbit_wc (wint_t wc, charclass *c)
+setbit_wc (char32_t wc, charclass *c)
 {
-  int b = wctob (wc);
+  int b = c32tob (wc);
   if (b < 0)
     return false;
 
@@ -1122,7 +1136,7 @@ parse_bracket_exp (struct dfa *dfa)
         known_bracket_exp = false;
       else
         {
-          wchar_t folded[CASE_FOLDED_BUFSIZE + 1];
+          char32_t folded[CASE_FOLDED_BUFSIZE + 1];
           int n = (dfa->syntax.case_fold
                    ? case_folded_counterparts (wc, folded + 1) + 1
                    : 1);
@@ -1564,15 +1578,24 @@ lex (struct dfa *dfa)
             {
               char const *msg;
               char msgbuf[100];
-              if (!iswprint (dfa->lex.wctok))
+              if (!c32isprint (dfa->lex.wctok))
                 msg = _("stray \\ before unprintable character");
-              else if (iswspace (dfa->lex.wctok))
+              else if (c32isspace (dfa->lex.wctok))
                 msg = _("stray \\ before white space");
               else
                 {
-                  int n = snprintf (msgbuf, sizeof msgbuf,
-                                    _("stray \\ before %lc"), dfa->lex.wctok);
-                  msg = 0 <= n && n < sizeof msgbuf ? msgbuf : _("stray \\");
+                  char buf[MB_LEN_MAX + 1];
+                  mbstate_t s = { 0 };
+                  size_t stored_bytes = c32rtomb (buf, dfa->lex.wctok, &s);
+                  if (stored_bytes < (size_t) -1)
+                    {
+                      buf[stored_bytes] = '\0';
+                      int n = snprintf (msgbuf, sizeof msgbuf,
+                                        _("stray \\ before %s"), buf);
+                      msg = 0 <= n && n < sizeof msgbuf ? msgbuf : _("stray \\");
+                    }
+                  else
+                    msg = _("stray \\");
                 }
               dfawarn (msg);
             }
@@ -1700,7 +1723,7 @@ addtok_wc (struct dfa *dfa, wint_t wc)
 {
   unsigned char buf[MB_LEN_MAX];
   mbstate_t s = { 0 };
-  size_t stored_bytes = wcrtomb ((char *) buf, wc, &s);
+  size_t stored_bytes = c32rtomb ((char *) buf, wc, &s);
   int buflen;
 
   if (stored_bytes != (size_t) -1)
@@ -1905,7 +1928,7 @@ atom (struct dfa *dfa)
 
           if (dfa->syntax.case_fold)
             {
-              wchar_t folded[CASE_FOLDED_BUFSIZE];
+              char32_t folded[CASE_FOLDED_BUFSIZE];
               int n = case_folded_counterparts (dfa->lex.wctok, folded);
               for (int i = 0; i < n; i++)
                 {
diff --git a/lib/localeinfo.c b/lib/localeinfo.c
index d0e63af656..16a17e4643 100644
--- a/lib/localeinfo.c
+++ b/lib/localeinfo.c
@@ -27,7 +27,17 @@
 #include <locale.h>
 #include <stdlib.h>
 #include <string.h>
-#include <wctype.h>
+#if GAWK
+/* Use ISO C 99 API.  */
+# include <wctype.h>
+# define char32_t wchar_t
+# define mbrtoc32 mbrtowc
+# define c32tolower towlower
+# define c32toupper towupper
+#else
+/* Use ISO C 11 + gnulib API.  */
+# include <uchar.h>
+#endif
 
 /* The sbclen implementation relies on this.  */
 static_assert (MB_LEN_MAX <= SCHAR_MAX);
@@ -37,9 +47,9 @@ static_assert (MB_LEN_MAX <= SCHAR_MAX);
 static bool
 is_using_utf8 (void)
 {
-  wchar_t wc;
+  char32_t wc;
   mbstate_t mbs = {0};
-  return mbrtowc (&wc, "\xc4\x80", 2, &mbs) == 2 && wc == 0x100;
+  return mbrtoc32 (&wc, "\xc4\x80", 2, &mbs) == 2 && wc == 0x100;
 }
 
 /* Return true if the locale is compatible enough with the C locale so
@@ -95,19 +105,19 @@ init_localeinfo (struct localeinfo *localeinfo)
       char c = i;
       unsigned char uc = i;
       mbstate_t s = {0};
-      wchar_t wc;
-      size_t len = mbrtowc (&wc, &c, 1, &s);
+      char32_t wc;
+      size_t len = mbrtoc32 (&wc, &c, 1, &s);
       localeinfo->sbclen[uc] = len <= 1 ? 1 : - (int) - len;
       localeinfo->sbctowc[uc] = len <= 1 ? wc : WEOF;
     }
 }
 
-/* The set of wchar_t values C such that there's a useful locale
+/* The set of char32_t values C such that there's a useful locale
    somewhere where C != towupper (C) && C != towlower (towupper (C)).
    For example, 0x00B5 (U+00B5 MICRO SIGN) is in this table, because
    towupper (0x00B5) == 0x039C (U+039C GREEK CAPITAL LETTER MU), and
    towlower (0x039C) == 0x03BC (U+03BC GREEK SMALL LETTER MU).  */
-static short const lonesome_lower[] =
+static unsigned short int const lonesome_lower[] =
   {
     0x00B5, 0x0131, 0x017F, 0x01C5, 0x01C8, 0x01CB, 0x01F2, 0x0345,
     0x03C2, 0x03D0, 0x03D1, 0x03D5, 0x03D6, 0x03F0, 0x03F1,
@@ -129,20 +139,20 @@ static_assert (1 + 1 + sizeof lonesome_lower / sizeof *lonesome_lower
    stored; this is zero if C is WEOF.  */
 
 int
-case_folded_counterparts (wint_t c, wchar_t folded[CASE_FOLDED_BUFSIZE])
+case_folded_counterparts (wint_t c, char32_t folded[CASE_FOLDED_BUFSIZE])
 {
   int i;
   int n = 0;
-  wint_t uc = towupper (c);
-  wint_t lc = towlower (uc);
+  wint_t uc = c32toupper (c);
+  wint_t lc = c32tolower (uc);
   if (uc != c)
     folded[n++] = uc;
-  if (lc != uc && lc != c && towupper (lc) == uc)
+  if (lc != uc && lc != c && c32toupper (lc) == uc)
     folded[n++] = lc;
   for (i = 0; i < sizeof lonesome_lower / sizeof *lonesome_lower; i++)
     {
       wint_t li = lonesome_lower[i];
-      if (li != lc && li != uc && li != c && towupper (li) == uc)
+      if (li != lc && li != uc && li != c && c32toupper (li) == uc)
         folded[n++] = li;
     }
   return n;
diff --git a/lib/localeinfo.h b/lib/localeinfo.h
index bd443ef491..383a93870c 100644
--- a/lib/localeinfo.h
+++ b/lib/localeinfo.h
@@ -21,6 +21,13 @@
 
 #include <limits.h>
 #include <wchar.h>
+#if GAWK
+/* Use ISO C 99 API.  */
+# define char32_t wchar_t
+#else
+/* Use ISO C 11 + gnulib API.  */
+# include <uchar.h>
+#endif
 
 struct localeinfo
 {
@@ -43,8 +50,8 @@ struct localeinfo
   signed char sbclen[UCHAR_MAX + 1];
 
   /* An array indexed by byte values B that contains the corresponding
-     wide character (if any) for B if sbclen[B] == 1.  WEOF means the
-     byte is not a valid single-byte character, i.e., sbclen[B] == -1
+     32-bit wide character (if any) for B if sbclen[B] == 1.  WEOF means
+     the byte is not a valid single-byte character, i.e., sbclen[B] == -1
      or -2.  */
   wint_t sbctowc[UCHAR_MAX + 1];
 };
@@ -56,4 +63,4 @@ extern void init_localeinfo (struct localeinfo *);
    itself.  This is a generous upper bound.  */
 enum { CASE_FOLDED_BUFSIZE = 32 };
 
-extern int case_folded_counterparts (wint_t, wchar_t[CASE_FOLDED_BUFSIZE]);
+extern int case_folded_counterparts (wint_t, char32_t[CASE_FOLDED_BUFSIZE]);
diff --git a/modules/dfa b/modules/dfa
index 793352e4f5..849f61151f 100644
--- a/modules/dfa
+++ b/modules/dfa
@@ -9,11 +9,18 @@ lib/localeinfo.h
 
 Depends-on:
 assert
+c32isprint
+c32isspace
+c32rtomb
+c32tob
+c32tolower
+c32toupper
 c99
 ctype
 flexmember
 idx
 locale
+mbrtoc32
 regex
 stdbool
 stddef
@@ -21,9 +28,13 @@ stdint
 stdio
 stdlib
 string
+uchar
+# The lonesome_lower array requires ISO C 23 semantics for char32_t.
+# But uchar-c23 has a global effect, therefore leave it to each package
+# to enable it.
+#uchar-c23
 verify
 wchar
-wctype-h
 xalloc
 xalloc-die
 
@@ -38,7 +49,9 @@ Include:
 "localeinfo.h"
 
 Link:
+$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise
 $(MBRTOWC_LIB)
+$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise
 
 License:
 GPL
diff --git a/modules/dfa-tests b/modules/dfa-tests
index 982d370171..b7c7c11d27 100644
--- a/modules/dfa-tests
+++ b/modules/dfa-tests
@@ -21,4 +21,4 @@ TESTS += \
   test-dfa-match.sh
 
 check_PROGRAMS += test-dfa-match-aux
-test_dfa_match_aux_LDADD = $(LDADD) $(SETLOCALE_LIB) @LIBINTL@ $(MBRTOWC_LIB)
+test_dfa_match_aux_LDADD = $(LDADD) $(SETLOCALE_LIB) $(LIBUNISTRING) @LIBINTL@ $(MBRTOWC_LIB) $(LIBC32CONV)
-- 
2.34.1

>From ad684fceb753089ca98a6d208cfffaf7ce25fcc5 Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Sat, 1 Jul 2023 15:49:09 +0200
Subject: [PATCH] grep: Update after gnulib changed.

* src/grep.c (setup_ok_fold, fgrep_icase_charlen): Change the element type of
the 'folded' array, to match the new signature of case_folded_counterparts.
---
 src/grep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/grep.c b/src/grep.c
index 491dd02..d04a388 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2261,7 +2261,7 @@ setup_ok_fold (void)
         continue;
 
       int ok = 1;
-      wchar_t folded[CASE_FOLDED_BUFSIZE];
+      char32_t folded[CASE_FOLDED_BUFSIZE];
       for (int n = case_folded_counterparts (wi, folded); 0 <= --n; )
         {
           char buf[MB_LEN_MAX];
@@ -2301,7 +2301,7 @@ fgrep_icase_charlen (char const *pat, idx_t patlen, mbstate_t *mbs)
   /* PAT starts with a multibyte character.  Fcompile works if the
      character has no case folded counterparts and toupper translates
      none of its encoding's bytes.  */
-  wchar_t folded[CASE_FOLDED_BUFSIZE];
+  char32_t folded[CASE_FOLDED_BUFSIZE];
   if (case_folded_counterparts (wc, folded))
     return -1;
   for (idx_t i = wn; 0 < --i; )
-- 
2.34.1

Re: From wchar_t to char32_t

Reply via email to