On 7/2/23 13:18, Bruno Haible wrote:
Paul Eggert wrote:
When can we get (size_t) -3 in a real-world system?
It can/could occur if all of the following conditions are met:
* The locale encoding is BIG5-HKSCS, e.g. on a glibc system the
zh_HK.BIG5-HKSCS the locale.
* The input is one of the 4 characters in that encoding that map to
a sequence of two Unicode characters:
input maps to
----- -------
0x88 0x62 U+00CA U+0304
0x88 0x64 U+00CA U+030C
0x88 0xA3 U+00EA U+0304
0x88 0xA5 U+00EA U+030C > ...
I looked into this some more and unfortunately don't understand the
above. Could you explain a bit more?
<http://www.nits.org.cn/index/article/4034> says that the official
mapping table for GB 18030-2022 and BMP is here:
http://www.nits.org.cn/cmsfile/download/134
and this contains the following (nonconsecutive) lines:
5746 8862
5749 8864
57BC 88A3
57BE 88A5
which, if I understand things correctly, means the four two-byte
sequences that you mention should convert to the following four Unicode
characters:
坆 U+5746 CJK IDEOGRAPH-5746
坉 U+5749 CJK IDEOGRAPH-5749
垼 U+57BC CJK IDEOGRAPH-57BC
垾 U+57BE CJK IDEOGRAPH-57BE
without mbrtoc23 having to return (size_t) -3.
Perhaps there was a problem with an earlier version of GB 18030 that has
been fixed in the 2022 edition?