Paul Eggert wrote: > >> in UTF-8 the byte sequence E0 80 is not an incomplete character > >> (in the sense that additional bytes may lead to a complete character), > >> because every byte you append to E0 80 causes glibc mbrtoc32 to return > >> (size_t) -1. Yet glibc mbrtoc32 returns (size_t) -2 for E0 80. > > > > And gnulib/lib/unistr/u8-mbtouc-aux.c does it wrong as well! > > The return value for E0 {80..9F} should be (size_t) -1, because > > U+0800 is E0 A0 80. > > > > I'll fix the gnulib part soon. Very good point. It looks like few people > > understood the implications of > > https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 125, table 3-7. > > I hope we don't need to replace mbrtoc32 merely because of this obscure > issue.
We don't need to, because - it's not blatant POSIX failure, - this bug has been in glibc for 20 years and in GNU libunistring for more than 10 years, and no one noticed, - the difference is only whether mbrtowc() returns (size_t)-1 vs. (size_t)-2. Bruno