Re: mbcel module for Gnulib?, incomplete multibyte sequences

Paul Eggert Fri, 21 Jul 2023 18:14:25 -0700

On 2023-07-21 17:33, Bruno Haible wrote:

It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
   - a complete character, or
   - an invalid character, or
   - an incomplete character (i.e. if additional bytes may lead to a
     complete character).

Ah, I had thought that the idea was to treat all the bytes of a bytesequence from 10646-1[1] R.2 Table 1 as a single invalid "character"(i.e., not a real character) if the byte sequence is not valid UTF-8.That's what Kuhn seems to be suggesting in [2].

But what you're saying is something different, that could be implementedby calling mbrtoc32.

For example, as I understand it, the byte sequence F4 90 80 80, which Ihad thought you were saying would be treated as a single byte sequence[F4 90 80 80] because that's in R.2 Table 1, would instead be treated as[F4 90] [80] [80], because [F4 90] is not an incomplete character(additional bytes cannot lead to a complete character).


Is this right?

[1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
[2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to