On 2023-07-24 17:34, Bruno Haible wrote:
Paul Eggert wrote:
It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
- a complete character, or
- an invalid character, or
- an incomplete character (i.e. if additional bytes may lead to a
complete character).
Ah, I had thought that the idea was to treat all the bytes of a byte
sequence from 10646-1[1] R.2 Table 1 as a single invalid "character"
(i.e., not a real character) if the byte sequence is not valid UTF-8.
An arbitrary sequence of invalid bytes (which therefore could be
arbitrarily long) is not meant here. This would not produce good results
for the user, and would not be implementable in O(1) space.
Arbitrarily long sequences would indeed be a problem, but R.2 Table 1
doesn't do that. The length limit is 6 bytes. (Not that this matters
much to us, since glibc isn't taking this approach.)
xterm is probably the only terminal emulator that renders the entire section 3
of [2] as Markus Kuhn proposed.
And even xterm doesn't follow Kuhn's section 5.
I've pretty much given up on having 'diff' accurately count columns of
display for encoding errors. It's just not practical.
Per [1] p. 127 paragraph 3, decoders can decompose it to
[F4 90] [80] [80]
or to
[F4] [90] [80] [80]
Since mbrtowc() returns (size_t)(-1) for this sequence, without telling
how long the invalid sequence was, decoders/scanners that are based on
mbrtowc() (or mbrtoc32()) will decompose it like this:
[F4] [90] [80] [80]
Actually, these decoders/scanners can decompose it either way. The way
you suggest is easier and I expect everybody does it that way. But a
decoder/scanner could do it the other way, by calling mbrtoc32 with n=1,
then with n=2, and so forth, and seeing when the return value stops
being (size_t) -2 and starts being (size_t) -1. A similar approach would
work for decoders/scanners that use mbcel or mbiter or etc.
Not that I'm suggesting this. diff can just do things the easy way that
I expect everybody else uses.
> [1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> [2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt