Here is a proposed patch to overcome the wchar_t limitation in the 'dfa' module.
Jim: The background is explained in <https://www.gnu.org/software/gnulib/manual/html_node/Strings-and-Characters.html> The plan was exposed in <https://lists.gnu.org/archive/html/bug-gnulib/2018-12/msg00118.html> and <https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00102.html> The 'grep' code needs a minimal change, accordingly. (Attached.) I have verified that this change does not cause test failures in 'grep' and in 'sed', on glibc systems, FreeBSD, and Solaris 11.4. Arnold: I have added '#if GAWK' conditionals, knowing that gawk's build system does not use gnulib-tool and you therefore pull from gnulib manually. This means the improvements will not land in gawk, since dfa in gawk will continue to use wchar_t. Objections? 2023-07-01 Bruno Haible <br...@clisp.org> dfa: Overcome wchar_t limitations. * lib/localeinfo.h: Include <uchar.h>. Add special definitions for GAWK. (case_folded_counterparts): Change array element type to char32_t. * lib/localeinfo.c: Include <uchar.h>. Add special definitions for GAWK. (is_using_utf8, init_localeinfo): Use mbrtoc32 instead of mbrtowc. (lonesome_lower): Change element type to 'unsigned short'. (case_folded_counterparts): Change array element type to char32_t. Use c32toupper instead of towupper. Use c32tolower instead of towlower. * lib/dfa.c: Include <uchar.h>. Add special definitions for GAWK. (struct mb_char_classes): Change element type of 'chars' to char32_t. (mbs_to_wchar): Use mbrtoc32 instead of mbrtowc. (setbit_wc): Change type of first argument to char32_t. Use c32tob instead of wctob. (parse_bracket_exp): Update. (lex): Use c32isprint instead of iswprint. Use c32isspace instead of iswspace. Use c32rtomb instead of a %lc directive. (addtok_wc): Use c32rtomb instead of wcrtomb. (atom): Update. * modules/dfa (Depends-on): Remove wctype-h. Add uchar, mbrtoc32, c32rtomb, c32tob, c32tolower, c32toupper, c32isprint, c32isspace. (Link): Add $(LIBUNISTRING) $(LIBC32CONV). * modules/dfa-tests (Makefile.am): Link test-dfa-match-aux with $(LIBUNISTRING) $(LIBC32CONV). * NEWS: Mention the change.
>From b0b542347103c0564e0cc168dc89d282f7e510c7 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Sat, 1 Jul 2023 14:50:51 +0200 Subject: [PATCH] dfa: Overcome wchar_t limitations. * lib/localeinfo.h: Include <uchar.h>. Add special definitions for GAWK. (case_folded_counterparts): Change array element type to char32_t. * lib/localeinfo.c: Include <uchar.h>. Add special definitions for GAWK. (is_using_utf8, init_localeinfo): Use mbrtoc32 instead of mbrtowc. (lonesome_lower): Change element type to 'unsigned short'. (case_folded_counterparts): Change array element type to char32_t. Use c32toupper instead of towupper. Use c32tolower instead of towlower. * lib/dfa.c: Include <uchar.h>. Add special definitions for GAWK. (struct mb_char_classes): Change element type of 'chars' to char32_t. (mbs_to_wchar): Use mbrtoc32 instead of mbrtowc. (setbit_wc): Change type of first argument to char32_t. Use c32tob instead of wctob. (parse_bracket_exp): Update. (lex): Use c32isprint instead of iswprint. Use c32isspace instead of iswspace. Use c32rtomb instead of a %lc directive. (addtok_wc): Use c32rtomb instead of wcrtomb. (atom): Update. * modules/dfa (Depends-on): Remove wctype-h. Add uchar, mbrtoc32, c32rtomb, c32tob, c32tolower, c32toupper, c32isprint, c32isspace. (Link): Add $(LIBUNISTRING) $(LIBC32CONV). * modules/dfa-tests (Makefile.am): Link test-dfa-match-aux with $(LIBUNISTRING) $(LIBC32CONV). * NEWS: Mention the change. --- ChangeLog | 27 ++++++++++++++++ NEWS | 4 +++ lib/dfa.c | 79 ++++++++++++++++++++++++++++++----------------- lib/localeinfo.c | 34 +++++++++++++------- lib/localeinfo.h | 13 ++++++-- modules/dfa | 15 ++++++++- modules/dfa-tests | 2 +- 7 files changed, 129 insertions(+), 45 deletions(-) diff --git a/ChangeLog b/ChangeLog index 10bf606af7..1840acffd7 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,30 @@ +2023-07-01 Bruno Haible <br...@clisp.org> + + dfa: Overcome wchar_t limitations. + * lib/localeinfo.h: Include <uchar.h>. Add special definitions for GAWK. + (case_folded_counterparts): Change array element type to char32_t. + * lib/localeinfo.c: Include <uchar.h>. Add special definitions for GAWK. + (is_using_utf8, init_localeinfo): Use mbrtoc32 instead of mbrtowc. + (lonesome_lower): Change element type to 'unsigned short'. + (case_folded_counterparts): Change array element type to char32_t. Use + c32toupper instead of towupper. Use c32tolower instead of towlower. + * lib/dfa.c: Include <uchar.h>. Add special definitions for GAWK. + (struct mb_char_classes): Change element type of 'chars' to char32_t. + (mbs_to_wchar): Use mbrtoc32 instead of mbrtowc. + (setbit_wc): Change type of first argument to char32_t. Use c32tob + instead of wctob. + (parse_bracket_exp): Update. + (lex): Use c32isprint instead of iswprint. Use c32isspace instead of + iswspace. Use c32rtomb instead of a %lc directive. + (addtok_wc): Use c32rtomb instead of wcrtomb. + (atom): Update. + * modules/dfa (Depends-on): Remove wctype-h. Add uchar, mbrtoc32, + c32rtomb, c32tob, c32tolower, c32toupper, c32isprint, c32isspace. + (Link): Add $(LIBUNISTRING) $(LIBC32CONV). + * modules/dfa-tests (Makefile.am): Link test-dfa-match-aux with + $(LIBUNISTRING) $(LIBC32CONV). + * NEWS: Mention the change. + 2023-07-01 Bruno Haible <br...@clisp.org> doc: Update regarding stable branches. diff --git a/NEWS b/NEWS index cfc8fe113a..0c55a12356 100644 --- a/NEWS +++ b/NEWS @@ -74,6 +74,10 @@ User visible incompatible changes Date Modules Changes +2023-07-01 dfa The signature of the function + case_folded_counterparts, declared in localeinfo.h, + has changed. + 2023-06-10 javacomp-script These modules now compile the Java code with option javacomp '-source 1.6' or higher. As a consequence, the compiler may emit notes "... uses unchecked or diff --git a/lib/dfa.c b/lib/dfa.c index f1bab73059..0c50dcc956 100644 --- a/lib/dfa.c +++ b/lib/dfa.c @@ -35,6 +35,27 @@ #include <stdlib.h> #include <limits.h> #include <string.h> +#include <wchar.h> + +#include "xalloc.h" +#include "localeinfo.h" + +#include "gettext.h" +#define _(str) gettext (str) + +#if GAWK +/* Use ISO C 99 API. */ +# include <wctype.h> +# define char32_t wchar_t +# define mbrtoc32 mbrtowc +# define c32rtomb wcrtomb +# define c32tob wctob +# define c32isprint iswprint +# define c32isspace iswspace +#else +/* Use ISO C 11 + gnulib API. */ +# include <uchar.h> +#endif /* Pacify gcc -Wanalyzer-null-dereference in areas where GCC understandably cannot deduce that the input comes from a @@ -55,15 +76,6 @@ c_isdigit (char c) return '0' <= c && c <= '9'; } -#include "gettext.h" -#define _(str) gettext (str) - -#include <wchar.h> -#include <wctype.h> - -#include "xalloc.h" -#include "localeinfo.h" - #ifndef FALLTHROUGH # if 201710L < __STDC_VERSION__ # define FALLTHROUGH [[__fallthrough__]] @@ -300,8 +312,8 @@ enum RPAREN, /* RPAREN never appears in the parse tree. */ - WCHAR, /* Only returned by lex. wctok contains - the wide character representation. */ + WCHAR, /* Only returned by lex. wctok contains the + 32-bit wide character representation. */ ANYCHAR, /* ANYCHAR is a terminal symbol that matches a valid multibyte (or single byte) character. @@ -394,7 +406,7 @@ struct mb_char_classes { ptrdiff_t cset; bool invert; - wchar_t *chars; /* Normal characters. */ + char32_t *chars; /* Normal characters. */ idx_t nchars; idx_t nchars_alloc; }; @@ -438,7 +450,7 @@ struct lexer_state idx_t parens; /* Count of outstanding left parens. */ int minrep, maxrep; /* Repeat counts for {m,n}. */ - /* Wide character representation of the current multibyte character, + /* 32-bit wide character representation of the current multibyte character, or WEOF if there was an encoding error. Used only if MB_CUR_MAX > 1. */ wint_t wctok; @@ -621,9 +633,9 @@ static void regexp (struct dfa *dfa); convert just a single byte, to WEOF. Return the number of bytes converted. - This differs from mbrtowc (PWC, S, N, &D->mbs) as follows: + This differs from mbrtoc32 (PWC, S, N, &D->mbs) as follows: - * PWC points to wint_t, not to wchar_t. + * PWC points to wint_t, not to char32_t. * The last arg is a dfa *D instead of merely a multibyte conversion state D->mbs. * N is idx_t not size_t, and must be at least 1. @@ -640,11 +652,13 @@ mbs_to_wchar (wint_t *pwc, char const *s, idx_t n, struct dfa *d) if (wc == WEOF) { - wchar_t wch; - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs); + char32_t wch; + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs); if (0 < nbytes && nbytes < (size_t) -2) { *pwc = wch; + if (nbytes == (size_t) -3) + nbytes = 0; return nbytes; } memset (&d->mbs, 0, sizeof d->mbs); @@ -844,15 +858,15 @@ char_context (struct dfa const *dfa, unsigned char c) return CTX_NONE; } -/* Set a bit in the charclass for the given wchar_t. Do nothing if WC +/* Set a bit in the charclass for the given char32_t. Do nothing if WC is represented by a multi-byte sequence. Even for MB_CUR_MAX == 1, this may happen when folding case in weird Turkish locales where dotless i/dotted I are not included in the chosen character set. Return whether a bit was set in the charclass. */ static bool -setbit_wc (wint_t wc, charclass *c) +setbit_wc (char32_t wc, charclass *c) { - int b = wctob (wc); + int b = c32tob (wc); if (b < 0) return false; @@ -1122,7 +1136,7 @@ parse_bracket_exp (struct dfa *dfa) known_bracket_exp = false; else { - wchar_t folded[CASE_FOLDED_BUFSIZE + 1]; + char32_t folded[CASE_FOLDED_BUFSIZE + 1]; int n = (dfa->syntax.case_fold ? case_folded_counterparts (wc, folded + 1) + 1 : 1); @@ -1564,15 +1578,24 @@ lex (struct dfa *dfa) { char const *msg; char msgbuf[100]; - if (!iswprint (dfa->lex.wctok)) + if (!c32isprint (dfa->lex.wctok)) msg = _("stray \\ before unprintable character"); - else if (iswspace (dfa->lex.wctok)) + else if (c32isspace (dfa->lex.wctok)) msg = _("stray \\ before white space"); else { - int n = snprintf (msgbuf, sizeof msgbuf, - _("stray \\ before %lc"), dfa->lex.wctok); - msg = 0 <= n && n < sizeof msgbuf ? msgbuf : _("stray \\"); + char buf[MB_LEN_MAX + 1]; + mbstate_t s = { 0 }; + size_t stored_bytes = c32rtomb (buf, dfa->lex.wctok, &s); + if (stored_bytes < (size_t) -1) + { + buf[stored_bytes] = '\0'; + int n = snprintf (msgbuf, sizeof msgbuf, + _("stray \\ before %s"), buf); + msg = 0 <= n && n < sizeof msgbuf ? msgbuf : _("stray \\"); + } + else + msg = _("stray \\"); } dfawarn (msg); } @@ -1700,7 +1723,7 @@ addtok_wc (struct dfa *dfa, wint_t wc) { unsigned char buf[MB_LEN_MAX]; mbstate_t s = { 0 }; - size_t stored_bytes = wcrtomb ((char *) buf, wc, &s); + size_t stored_bytes = c32rtomb ((char *) buf, wc, &s); int buflen; if (stored_bytes != (size_t) -1) @@ -1905,7 +1928,7 @@ atom (struct dfa *dfa) if (dfa->syntax.case_fold) { - wchar_t folded[CASE_FOLDED_BUFSIZE]; + char32_t folded[CASE_FOLDED_BUFSIZE]; int n = case_folded_counterparts (dfa->lex.wctok, folded); for (int i = 0; i < n; i++) { diff --git a/lib/localeinfo.c b/lib/localeinfo.c index d0e63af656..16a17e4643 100644 --- a/lib/localeinfo.c +++ b/lib/localeinfo.c @@ -27,7 +27,17 @@ #include <locale.h> #include <stdlib.h> #include <string.h> -#include <wctype.h> +#if GAWK +/* Use ISO C 99 API. */ +# include <wctype.h> +# define char32_t wchar_t +# define mbrtoc32 mbrtowc +# define c32tolower towlower +# define c32toupper towupper +#else +/* Use ISO C 11 + gnulib API. */ +# include <uchar.h> +#endif /* The sbclen implementation relies on this. */ static_assert (MB_LEN_MAX <= SCHAR_MAX); @@ -37,9 +47,9 @@ static_assert (MB_LEN_MAX <= SCHAR_MAX); static bool is_using_utf8 (void) { - wchar_t wc; + char32_t wc; mbstate_t mbs = {0}; - return mbrtowc (&wc, "\xc4\x80", 2, &mbs) == 2 && wc == 0x100; + return mbrtoc32 (&wc, "\xc4\x80", 2, &mbs) == 2 && wc == 0x100; } /* Return true if the locale is compatible enough with the C locale so @@ -95,19 +105,19 @@ init_localeinfo (struct localeinfo *localeinfo) char c = i; unsigned char uc = i; mbstate_t s = {0}; - wchar_t wc; - size_t len = mbrtowc (&wc, &c, 1, &s); + char32_t wc; + size_t len = mbrtoc32 (&wc, &c, 1, &s); localeinfo->sbclen[uc] = len <= 1 ? 1 : - (int) - len; localeinfo->sbctowc[uc] = len <= 1 ? wc : WEOF; } } -/* The set of wchar_t values C such that there's a useful locale +/* The set of char32_t values C such that there's a useful locale somewhere where C != towupper (C) && C != towlower (towupper (C)). For example, 0x00B5 (U+00B5 MICRO SIGN) is in this table, because towupper (0x00B5) == 0x039C (U+039C GREEK CAPITAL LETTER MU), and towlower (0x039C) == 0x03BC (U+03BC GREEK SMALL LETTER MU). */ -static short const lonesome_lower[] = +static unsigned short int const lonesome_lower[] = { 0x00B5, 0x0131, 0x017F, 0x01C5, 0x01C8, 0x01CB, 0x01F2, 0x0345, 0x03C2, 0x03D0, 0x03D1, 0x03D5, 0x03D6, 0x03F0, 0x03F1, @@ -129,20 +139,20 @@ static_assert (1 + 1 + sizeof lonesome_lower / sizeof *lonesome_lower stored; this is zero if C is WEOF. */ int -case_folded_counterparts (wint_t c, wchar_t folded[CASE_FOLDED_BUFSIZE]) +case_folded_counterparts (wint_t c, char32_t folded[CASE_FOLDED_BUFSIZE]) { int i; int n = 0; - wint_t uc = towupper (c); - wint_t lc = towlower (uc); + wint_t uc = c32toupper (c); + wint_t lc = c32tolower (uc); if (uc != c) folded[n++] = uc; - if (lc != uc && lc != c && towupper (lc) == uc) + if (lc != uc && lc != c && c32toupper (lc) == uc) folded[n++] = lc; for (i = 0; i < sizeof lonesome_lower / sizeof *lonesome_lower; i++) { wint_t li = lonesome_lower[i]; - if (li != lc && li != uc && li != c && towupper (li) == uc) + if (li != lc && li != uc && li != c && c32toupper (li) == uc) folded[n++] = li; } return n; diff --git a/lib/localeinfo.h b/lib/localeinfo.h index bd443ef491..383a93870c 100644 --- a/lib/localeinfo.h +++ b/lib/localeinfo.h @@ -21,6 +21,13 @@ #include <limits.h> #include <wchar.h> +#if GAWK +/* Use ISO C 99 API. */ +# define char32_t wchar_t +#else +/* Use ISO C 11 + gnulib API. */ +# include <uchar.h> +#endif struct localeinfo { @@ -43,8 +50,8 @@ struct localeinfo signed char sbclen[UCHAR_MAX + 1]; /* An array indexed by byte values B that contains the corresponding - wide character (if any) for B if sbclen[B] == 1. WEOF means the - byte is not a valid single-byte character, i.e., sbclen[B] == -1 + 32-bit wide character (if any) for B if sbclen[B] == 1. WEOF means + the byte is not a valid single-byte character, i.e., sbclen[B] == -1 or -2. */ wint_t sbctowc[UCHAR_MAX + 1]; }; @@ -56,4 +63,4 @@ extern void init_localeinfo (struct localeinfo *); itself. This is a generous upper bound. */ enum { CASE_FOLDED_BUFSIZE = 32 }; -extern int case_folded_counterparts (wint_t, wchar_t[CASE_FOLDED_BUFSIZE]); +extern int case_folded_counterparts (wint_t, char32_t[CASE_FOLDED_BUFSIZE]); diff --git a/modules/dfa b/modules/dfa index 793352e4f5..849f61151f 100644 --- a/modules/dfa +++ b/modules/dfa @@ -9,11 +9,18 @@ lib/localeinfo.h Depends-on: assert +c32isprint +c32isspace +c32rtomb +c32tob +c32tolower +c32toupper c99 ctype flexmember idx locale +mbrtoc32 regex stdbool stddef @@ -21,9 +28,13 @@ stdint stdio stdlib string +uchar +# The lonesome_lower array requires ISO C 23 semantics for char32_t. +# But uchar-c23 has a global effect, therefore leave it to each package +# to enable it. +#uchar-c23 verify wchar -wctype-h xalloc xalloc-die @@ -38,7 +49,9 @@ Include: "localeinfo.h" Link: +$(LTLIBUNISTRING) when linking with libtool, $(LIBUNISTRING) otherwise $(MBRTOWC_LIB) +$(LTLIBC32CONV) when linking with libtool, $(LIBC32CONV) otherwise License: GPL diff --git a/modules/dfa-tests b/modules/dfa-tests index 982d370171..b7c7c11d27 100644 --- a/modules/dfa-tests +++ b/modules/dfa-tests @@ -21,4 +21,4 @@ TESTS += \ test-dfa-match.sh check_PROGRAMS += test-dfa-match-aux -test_dfa_match_aux_LDADD = $(LDADD) $(SETLOCALE_LIB) @LIBINTL@ $(MBRTOWC_LIB) +test_dfa_match_aux_LDADD = $(LDADD) $(SETLOCALE_LIB) $(LIBUNISTRING) @LIBINTL@ $(MBRTOWC_LIB) $(LIBC32CONV) -- 2.34.1
>From ad684fceb753089ca98a6d208cfffaf7ce25fcc5 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Sat, 1 Jul 2023 15:49:09 +0200 Subject: [PATCH] grep: Update after gnulib changed. * src/grep.c (setup_ok_fold, fgrep_icase_charlen): Change the element type of the 'folded' array, to match the new signature of case_folded_counterparts. --- src/grep.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/grep.c b/src/grep.c index 491dd02..d04a388 100644 --- a/src/grep.c +++ b/src/grep.c @@ -2261,7 +2261,7 @@ setup_ok_fold (void) continue; int ok = 1; - wchar_t folded[CASE_FOLDED_BUFSIZE]; + char32_t folded[CASE_FOLDED_BUFSIZE]; for (int n = case_folded_counterparts (wi, folded); 0 <= --n; ) { char buf[MB_LEN_MAX]; @@ -2301,7 +2301,7 @@ fgrep_icase_charlen (char const *pat, idx_t patlen, mbstate_t *mbs) /* PAT starts with a multibyte character. Fcompile works if the character has no case folded counterparts and toupper translates none of its encoding's bytes. */ - wchar_t folded[CASE_FOLDED_BUFSIZE]; + char32_t folded[CASE_FOLDED_BUFSIZE]; if (case_folded_counterparts (wc, folded)) return -1; for (idx_t i = wn; 0 < --i; ) -- 2.34.1