[Quick answer on this part:] Paul Eggert wrote: > What does mbiterf do in non-UTF-8 multi-byte locales? How can it tell > how long the invalid sequence is?
It gets this info from mbrtoc32, which on most platforms gets this info from mbrtowc. This multibyte scanner knows when the bytes it has seen so far constitute - a complete character, or - an invalid character, or - an incomplete character (i.e. if additional bytes may lead to a complete character). For example, for EUC-JP, it has these scanning rules: 1st byte in [0x00..0x7F] => complete character 1st byte in [0x8E..0x8F] ∪ [0xA1..0xA8] ∪ [0xB0..0xFE] => incomplete character with 1 byte so far 1st byte != 0x8F and 2nd byte in [0xA1..0xFE] => either complete or invalid character 1st byte == 0x8F and 2nd byte in {0xA2} ∪ [0xA6..0xA7] ∪ [0xA9..0xAB] ∪ [0xB0..0xFE] => incomplete character with 2 bytes so far 1st byte == 0x8F and 2nd byte has another value => invalid character 1st byte has another value => invalid character Similarly for each other encoding. Bruno