Re: From wchar_t to char32_t, new module mbszero

2023-07-19 Thread Bruno Haible
Paul Eggert wrote: > > On NetBSD, I apparently did not locate the right source code of the mbsinit > > function, due to the complexity of the citrus code. And did not want to > > debug > > it, because debugging in libc code without debugging information is often > > a waste of time. > > I looked

Re: From wchar_t to char32_t, new module mbszero

2023-07-18 Thread Paul Eggert
On 2023-07-17 09:53, Bruno Haible wrote: On NetBSD, I apparently did not locate the right source code of the mbsinit function, due to the complexity of the citrus code. And did not want to debug it, because debugging in libc code without debugging information is often a waste of time. I looked

Re: From wchar_t to char32_t, new module mbszero

2023-07-17 Thread Bruno Haible
Running the test suite on Minix shows test failures. There, like on NetBSD, the source code investigation was incomplete. This patch fixes the failures. 2023-07-17 Bruno Haible mbszero: Fix for Minix. * lib/wchar.in.h: (_GL_MBSTATE_INIT_SIZE): Don't define on Minix. (_

Re: From wchar_t to char32_t, new module mbszero

2023-07-17 Thread Bruno Haible
Paul Eggert wrote: > > However, after implementing mbszero with this data and enabling its use > > in many places, I got test failures on NetBSD and Solaris. > >- On NetBSD, the minimum we need to clear is 28 bytes. > >- On Solaris OmniOS and OpenIndiana, the minimum we need to clear is 16

Re: From wchar_t to char32_t, new module mbszero

2023-07-16 Thread Paul Eggert
On 2023-07-16 01:43, Bruno Haible wrote: Paul Eggert wrote: However, after implementing mbszero with this data and enabling its use in many places, I got test failures on NetBSD and Solaris. - On NetBSD, the minimum we need to clear is 28 bytes. - On Solaris OmniOS and OpenIndiana, the m

Re: From wchar_t to char32_t, new module mbszero

2023-07-16 Thread Bruno Haible
Paul Eggert wrote: > > By reading the source code of FreeBSD, NetBSD, OpenBSD, macOS, Solaris, > > and so on, I can easily determine > >- which parts of the mbstate_t mbsinit() tests, > >- which parts of the mbstate_t the various functions use. > > But in order to understand what interdepen

Re: From wchar_t to char32_t

2023-07-13 Thread Paul Eggert
On 2023-07-13 08:14, Bruno Haible wrote: By reading the source code of FreeBSD, NetBSD, OpenBSD, macOS, Solaris, and so on, I can easily determine - which parts of the mbstate_t mbsinit() tests, - which parts of the mbstate_t the various functions use. But in order to understand what inter

Re: From wchar_t to char32_t

2023-07-13 Thread Bruno Haible
Paul Eggert wrote: > > Based on the comments in gnulib/lib/mbrtoc16.c, I think it should better > > clear the first 24, not 12, bytes of the struct. Otherwise it can be in > > a state where mbsinit() returns true but the mbrto* functions have > > undefined behaviour. > > For mbcel all all that mat

Re: From wchar_t to char32_t

2023-07-13 Thread Bruno Haible
I wrote: > 7.32.3.2 > towctrans -- rarely used > wctrans -- rarely used It's not hard to implement replacements for these two functions either. So that we get this correspondence: wchar_t char32_t --- t

Re: From wchar_t to char32_t

2023-07-12 Thread Bruno Haible
I wrote: > 7.32.2.2 > iswctype-- rarely used > wctype -- rarely used Well, 'fnmatch' and 'regex' uses iswctype and wctype. So, we need to have counterparts of these functions in the char32_t world. Since the ISO C names are idiosyncratic, let

Re: From wchar_t to char32_t

2023-07-11 Thread Paul Eggert
On 7/11/23 15:32, Bruno Haible wrote: You are looking at GB18030. GB18030 and BIG5-HKSCS are completely unrelated. Ouch! Thanks for explaining.

Re: From wchar_t to char32_t

2023-07-11 Thread Bruno Haible
Paul Eggert wrote: > >* The locale encoding is BIG5-HKSCS, e.g. on a glibc system the > > zh_HK.BIG5-HKSCS the locale. > > > >* The input is one of the 4 characters in that encoding that map to > > a sequence of two Unicode characters: > > > > input maps to > >

Re: From wchar_t to char32_t

2023-07-11 Thread Paul Eggert
On 7/2/23 13:18, Bruno Haible wrote: Paul Eggert wrote: When can we get (size_t) -3 in a real-world system? It can/could occur if all of the following conditions are met: * The locale encoding is BIG5-HKSCS, e.g. on a glibc system the zh_HK.BIG5-HKSCS the locale. * The input is on

Re: From wchar_t to char32_t

2023-07-11 Thread Paul Eggert
On 2023-07-11 01:24, Bruno Haible wrote: Based on the comments in gnulib/lib/mbrtoc16.c, I think it should better clear the first 24, not 12, bytes of the struct. Otherwise it can be in a state where mbsinit() returns true but the mbrto* functions have undefined behaviour. For mbcel all all tha

Re: From wchar_t to char32_t

2023-07-11 Thread Bruno Haible
Paul Eggert wrote: > We can improve on that. I installed the attached two performance tweaks; > the second tweak cuts that initialization from 128 down to at most 12 > bytes on those platforms. Based on the comments in gnulib/lib/mbrtoc16.c, I think it should better clear the first 24, not 12, b

Re: From wchar_t to char32_t

2023-07-11 Thread Paul Eggert
On 2023-07-10 07:58, Bruno Haible wrote: - The rationale for defining and initializing the mbstate_t at the function scope was that on BSD and macOS systems, an mbstate_t is 128 bytes large, We can improve on that. I installed the attached two performance tweaks; the second tweak cuts

Re: From wchar_t to char32_t

2023-07-10 Thread Bruno Haible
Paul Eggert wrote: > > - wchar_t wch; > > - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs); > > + char32_t wch; > > + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs); > >if (0 < nbytes && nbytes < (size_t) -2) > > { > >*pwc = wch; > > + if (nb

Re: From wchar_t to char32_t

2023-07-10 Thread Bruno Haible
Regarding my proposed 'dfa' module patch: Paul Eggert wrote on 2023-07-04: > > - wchar_t wch; > > - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs); > > + char32_t wch; > > + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs); > >if (0 < nbytes && nbytes < (size_t) -2) > >

Re: From wchar_t to char32_t

2023-07-10 Thread Bruno Haible
Paul Eggert wrote on 2023-07-06: > in reviewing it found a minor > glitch or two and some opportunities for simplification. I installed the > attached further patch which I hope fixes glitches without breaking > anything else. Comments: - Typo: s/mbrtoc23/mbrtoc32/ - The rationale for def

Re: From wchar_t to char32_t

2023-07-06 Thread Paul Eggert
On 2023-07-06 11:34, Bruno Haible wrote: Indeed, this is the solution that makes no assumptions. Find a patch that does it. Thanks, I think. I installed that, and in reviewing it found a minor glitch or two and some opportunities for simplification. I installed the attached further patch whi

Re: From wchar_t to char32_t

2023-07-06 Thread Bruno Haible
Paul Eggert wrote: > I still see a couple of problems with it. First, it mishandles the case > where mbrtoc32 returns 0, which ISO C allows. I thought that we could assume that no locale encoding maps a multibyte sequence other than "\0" to (char32_t) 0. But OK, if you don't want to assume that,

Re: From wchar_t to char32_t

2023-07-04 Thread Paul Eggert
On 2023-07-01 07:35, Bruno Haible wrote: - wchar_t wch; - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs); + char32_t wch; + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs); if (0 < nbytes && nbytes < (size_t) -2) { *pwc = wch; + if (nbytes =

Re: From wchar_t to char32_t

2023-07-04 Thread Jim Meyering
On Sat, Jul 1, 2023 at 7:35 AM Bruno Haible wrote: > > Here is a proposed patch to overcome the wchar_t limitation in the 'dfa' > module. > > Jim: The background is explained in > > The plan was exposed in >

Re: From wchar_t to char32_t

2023-07-04 Thread Paul Eggert
On 2023-07-04 12:31, Bruno Haible wrote: Yes. As far as I can see, this proposed patch should cope with (size_t) -3 returns correctly. I still see a couple of problems with it. First, it mishandles the case where mbrtoc32 returns 0, which ISO C allows. Second and more interestingly, its "fw

Re: From wchar_t to char32_t

2023-07-04 Thread Bruno Haible
[CCing diffutils-devel.] Paul Eggert wrote in : > >Level 3: Behave correctly. Don't split a 2-Unicode-character sequence. > > This is what code that uses mbrtoc32() does, when it has the > > lines > >

Re: From wchar_t to char32_t

2023-07-04 Thread Bruno Haible
I wrote: > Level 2: Behave correctly, except that a 2-Unicode-character sequence >may be split although it shouldn't. >This is what code that uses mbrtoc32() does, when it has the >lines > if (bytes == (size_t) -3) > bytes = 0;

Re: From wchar_t to char32_t

2023-07-03 Thread Paul Eggert
On 2023-07-03 15:00, Bruno Haible wrote: Level 3: Behave correctly. Don't split a 2-Unicode-character sequence. This is what code that uses mbrtoc32() does, when it has the lines if (bytes == (size_t) -3) bytes = 0; and us

Re: From wchar_t to char32_t

2023-07-03 Thread Bruno Haible
Paul Eggert wrote: > The complication would be needed because diffutils is trying to count > columns as it goes, and in some cases it needs to stop when a column > count has reached a maximum. It's not two lines of code. Indeed. I need to check the mbiter and mbuiter modules, since they do somet

Re: From wchar_t to char32_t

2023-07-03 Thread Paul Eggert
Come to think of it this (size_t) -3 issue with mbrtoc32 is probably worth documenting. I installed the attached to give it a shot.From e046d5458353f112e78893ca03d855c8a9aa2e39 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Mon, 3 Jul 2023 10:24:05 -0700 Subject: [PATCH] mbrtoc32: document (siz

Re: From wchar_t to char32_t

2023-07-03 Thread Paul Eggert
On 2023-07-02 13:18, Bruno Haible wrote: If (size_t) -3 is possible, I suppose I should change diffutils to take this into account, as bleeding-edge diffutils/src/side.c treats (size_t) -3 as meaning the next input byte is an encoding error, which is obviously wrong. If you want the diffutils co

Re: From wchar_t to char32_t

2023-07-02 Thread Bruno Haible
Paul Eggert wrote: > On 2023-07-02 06:33, Bruno Haible wrote: > > +else if (bytes == (size_t) -3) > > + bytes = 0; > > Why is this sort of thing needed? I tried to explain it in https://lists.gnu.org/archive/html/bug-gnulib/2023-06/msg00134.html . Basical

Re: From wchar_t to char32_t

2023-07-02 Thread Paul Eggert
On 2023-07-02 06:33, Bruno Haible wrote: +else if (bytes == (size_t) -3) + bytes = 0; Why is this sort of thing needed? I thought that (size_t) -3 was possible only after a low surrogate, which is possible when decoding valid UTF-16 to Unicode, but no

Re: From wchar_t to char32_t

2023-07-02 Thread Bruno Haible
I wrote: > In mbrtoc32: > >Return value Consumed bytes > -- >small n > 0n >0 1 >(size_t)(-3) 0 > > The patch below thus fixes the uses of mbrtoc32. More of the same kind: 2023-07-02 Brun

Re: From wchar_t to char32_t

2023-07-02 Thread Bruno Haible
This patch migrates the 'mbmemcasecoll' module from wchar_t to char32_t. 2023-07-02 Bruno Haible mbmemcasecoll: Overcome wchar_t limitations. * lib/mbmemcasecoll.c: Include instead of . (apply_c32tolower): Renamed from apply_towlower. Use mbrtoc32

Re: From wchar_t to char32_t

2023-07-02 Thread Bruno Haible
This patch migrates the 'mbswidth' module from wchar_t to char32_t. 2023-07-02 Bruno Haible mbswidth: Overcome wchar_t limitations. * lib/mbswidth.c: Include instead of . (mbsnwidth): Use mbrtoc32 instead of mbrtowc. Use c32width instead of wc

Re: From wchar_t to char32_t

2023-07-02 Thread arnold
Hi. Bruno Haible wrote: > Arnold: I have added '#if GAWK' conditionals, knowing that gawk's build system > does not use gnulib-tool and you therefore pull from gnulib manually. This > means the improvements will not land in gawk, since dfa in gawk will continue > to use wchar_t. Much thanks. >

Re: From wchar_t to char32_t

2023-07-01 Thread Bruno Haible
Here is a proposed patch to overcome the wchar_t limitation in the 'dfa' module. Jim: The background is explained in The plan was exposed in

Re: From wchar_t to char32_t

2023-07-01 Thread Bruno Haible
This patch migrates the 'quotearg' module from wchar_t to char32_t. Note about the link dependency updates: - Adding $(LIBUNISTRING) is needed to avoid link errors on macOS, FreeBSD, NetBSD, Solaris when module 'libunistring-optional' is present and a libunistring is

Re: From wchar_t to char32_t

2023-06-30 Thread Bruno Haible
I did: > * lib/mbiter.h: Include instead of . > (mbiter_multi_next): Use mbrtoc32 instead of mbrtowc. > * lib/mbuiter.h: Include instead of . > (mbuiter_multi_next): Use mbrtoc32 instead of mbrtowc. > * lib/mbfile.h (mbfile_multi_getc): Use mbrtoc32 instead of mbrtow

Re: From wchar_t to char32_t

2023-06-25 Thread Bruno Haible
There was a small mistake in this patch, fixed like this: 2023-06-25 Bruno Haible exclude: Complete last change. * lib/exclude.c: Include instead of . diff --git a/lib/exclude.c b/lib/exclude.c index af204cd300..15f238e09c 100644 --- a/lib/exclude.c +++ b/lib/exclude.c @@ -2

Re: From wchar_t to char32_t

2023-06-24 Thread Bruno Haible
I wrote: > In Gnulib, the following areas will need migration: > > * lib/mbchar.h > lib/mbiter.h > lib/mbuiter.h > Draft patch attached. It seems to work fine, so I pushed this: 2023-06-24 Bruno Haible mbchar, mbiter, mbuiter: Overcome wchar_t limitations. * lib/mbchar

From wchar_t to char32_t

2023-06-19 Thread Bruno Haible
al/html_node/Strings-and-Characters.html The migration from wchar_t to char32_t can be done by writing 'char32_t' instead of 'wchar_t', and replacing function names according to this table: wchar_t char32_t --- 7.31.2 *wprintf