On 2023-07-21 17:33, Bruno Haible wrote:
It gets this info from mbrtoc32, which on most platforms gets this info from mbrtowc. This multibyte scanner knows when the bytes it has seen so far constitute - a complete character, or - an invalid character, or - an incomplete character (i.e. if additional bytes may lead to a complete character).
Ah, I had thought that the idea was to treat all the bytes of a byte sequence from 10646-1[1] R.2 Table 1 as a single invalid "character" (i.e., not a real character) if the byte sequence is not valid UTF-8. That's what Kuhn seems to be suggesting in [2].
But what you're saying is something different, that could be implemented by calling mbrtoc32.
For example, as I understand it, the byte sequence F4 90 80 80, which I had thought you were saying would be treated as a single byte sequence [F4 90 80 80] because that's in R.2 Table 1, would instead be treated as [F4 90] [80] [80], because [F4 90] is not an incomplete character (additional bytes cannot lead to a complete character).
Is this right? [1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html [2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt