I did: > * lib/mbiter.h: Include <uchar.h> instead of <wchar.h>. > (mbiter_multi_next): Use mbrtoc32 instead of mbrtowc. > * lib/mbuiter.h: Include <uchar.h> instead of <wchar.h>. > (mbuiter_multi_next): Use mbrtoc32 instead of mbrtowc. > * lib/mbfile.h (mbfile_multi_getc): Use mbrtoc32 instead of mbrtowc.
There's small difference between mbrtowc and mbrtoc32: While the return values (size_t)(-1) and (size_t)(-2) have the same meaning, mbrtoc32 (in theory) has a possible return value (size_t)(-3). This adds one case to the rule how to compute the number of consumed bytes. In mbrtowc: Return value Consumed bytes ------------ -------------- small n > 0 n 0 1 In mbrtoc32: Return value Consumed bytes ------------ -------------- small n > 0 n 0 1 (size_t)(-3) 0 The patch below thus fixes the uses of mbrtoc32. I said "in theory". This situation occurs if and only if there is a character in the locale's encoding that corresponds to a sequence of two or more Unicode characters. To find which encodings have these properties, do $ ls -1 glibc/iconvdata/*.precomposed glibc/iconvdata/BIG5HKSCS.precomposed glibc/iconvdata/EUC-JISX0213.precomposed glibc/iconvdata/SHIFT_JISX0213.precomposed glibc/iconvdata/TCVN5712-1.precomposed glibc/iconvdata/TSCII.precomposed The encodings EUC-JISX0213, SHIFT_JISX0213, TSCII are not used as the locale encoding of any locale on any system (see localcharset.h). TCVN5712-1 was used as a locale encoding until 2012-05-21 (see glibc/localedata/SUPPORTED). The only system that still has a locale with BIG5-HKSCS encoding is glibc, AFAIK. But since in glibc, mbrtoc32 is identical to mbrtowc (except for the private internal state), mbrtoc32 cannot return (size_t)(-3) either. (Although maybe glibc may get fixed to handle the zh_HK.BIG5-HKSCS locale better? Or maybe this locale will be dropped, like the TCVN5712-1 locale before?) So, for the moment, no mbrtoc32() implementation returns (size_t)(-3). But IMO, in order to be future-proof, we should include the code to handle this case; especially since it's only 2 lines of code. 2023-06-30 Bruno Haible <br...@clisp.org> Accommodate a difference between mbrtowc and mbrtoc32. * lib/mbiter.h (mbiter_multi_next): Handle the mbrtoc32 return value (size_t)(-3). * lib/mbuiter.h (mbuiter_multi_next): Likewise. * lib/mbfile.h (mbfile_multi_getc): Likewise. diff --git a/lib/mbfile.h b/lib/mbfile.h index 7c6d70fcae..716ab3fc89 100644 --- a/lib/mbfile.h +++ b/lib/mbfile.h @@ -183,6 +183,10 @@ mbfile_multi_getc (struct mbchar *mbc, struct mbfile_multi *mbf) assert (mbf->buf[0] == '\0'); assert (mbc->wc == 0); } + else if (bytes == (size_t) -3) + /* The previous multibyte sequence produced an additional 32-bit + wide character. */ + bytes = 0; mbc->wc_valid = true; break; } diff --git a/lib/mbiter.h b/lib/mbiter.h index 93bad990a1..fadefe104b 100644 --- a/lib/mbiter.h +++ b/lib/mbiter.h @@ -163,6 +163,10 @@ mbiter_multi_next (struct mbiter_multi *iter) assert (*iter->cur.ptr == '\0'); assert (iter->cur.wc == 0); } + else if (iter->cur.bytes == (size_t) -3) + /* The previous multibyte sequence produced an additional 32-bit + wide character. */ + iter->cur.bytes = 0; iter->cur.wc_valid = true; /* When in the initial state, we can go back treating ASCII diff --git a/lib/mbuiter.h b/lib/mbuiter.h index 02e3190f1c..954e11f635 100644 --- a/lib/mbuiter.h +++ b/lib/mbuiter.h @@ -172,6 +172,10 @@ mbuiter_multi_next (struct mbuiter_multi *iter) assert (*iter->cur.ptr == '\0'); assert (iter->cur.wc == 0); } + else if (iter->cur.bytes == (size_t) -3) + /* The previous multibyte sequence produced an additional 32-bit + wide character. */ + iter->cur.bytes = 0; iter->cur.wc_valid = true; /* When in the initial state, we can go back treating ASCII