Re: mbcel module for Gnulib?, incomplete multibyte sequences

Paul Eggert Tue, 25 Jul 2023 20:55:13 -0700

On 2023-07-24 17:34, Bruno Haible wrote:

Paul Eggert wrote:

It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
    - a complete character, or
    - an invalid character, or
    - an incomplete character (i.e. if additional bytes may lead to a
      complete character).


Ah, I had thought that the idea was to treat all the bytes of a byte
sequence from 10646-1[1] R.2 Table 1 as a single invalid "character"
(i.e., not a real character) if the byte sequence is not valid UTF-8.


An arbitrary sequence of invalid bytes (which therefore could be
arbitrarily long) is not meant here. This would not produce good results
for the user, and would not be implementable in O(1) space.

Arbitrarily long sequences would indeed be a problem, but R.2 Table 1doesn't do that. The length limit is 6 bytes. (Not that this mattersmuch to us, since glibc isn't taking this approach.)

xterm is probably the only terminal emulator that renders the entire section 3
of [2] as Markus Kuhn proposed.


And even xterm doesn't follow Kuhn's section 5.

I've pretty much given up on having 'diff' accurately count columns ofdisplay for encoding errors. It's just not practical.

Per [1] p. 127 paragraph 3, decoders can decompose it to
   [F4 90] [80] [80]
or to
   [F4] [90] [80] [80]

Since mbrtowc() returns (size_t)(-1) for this sequence, without telling
how long the invalid sequence was, decoders/scanners that are based on
mbrtowc() (or mbrtoc32()) will decompose it like this:
   [F4] [90] [80] [80]

Actually, these decoders/scanners can decompose it either way. The wayyou suggest is easier and I expect everybody does it that way. But adecoder/scanner could do it the other way, by calling mbrtoc32 with n=1,then with n=2, and so forth, and seeing when the return value stopsbeing (size_t) -2 and starts being (size_t) -1. A similar approach wouldwork for decoders/scanners that use mbcel or mbiter or etc.

Not that I'm suggesting this. diff can just do things the easy way thatI expect everybody else uses.

> [1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> [2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to