Re: mbcel module for Gnulib?, incomplete multibyte sequences

Bruno Haible Fri, 21 Jul 2023 17:33:25 -0700

[Quick answer on this part:]

Paul Eggert wrote:
> What does mbiterf do in non-UTF-8 multi-byte locales? How can it tell 
> how long the invalid sequence is?


It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
  - a complete character, or
  - an invalid character, or
  - an incomplete character (i.e. if additional bytes may lead to a
    complete character).

For example, for EUC-JP, it has these scanning rules:
  1st byte in [0x00..0x7F] => complete character
  1st byte in [0x8E..0x8F] ∪ [0xA1..0xA8] ∪ [0xB0..0xFE]
       => incomplete character with 1 byte so far
    1st byte != 0x8F and 2nd byte in [0xA1..0xFE]
        => either complete or invalid character
    1st byte == 0x8F and 2nd byte in {0xA2} ∪ [0xA6..0xA7] ∪ [0xA9..0xAB] ∪ 
[0xB0..0xFE]
        => incomplete character with 2 bytes so far
    1st byte == 0x8F and 2nd byte has another value
        => invalid character
  1st byte has another value
      => invalid character

Similarly for each other encoding.

Bruno

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to