Il 07/06/2012 16:51, Eric Blake ha scritto: > On 06/07/2012 08:13 AM, Paolo Bonzini wrote: >> Il 07/06/2012 14:50, Eric Blake ha scritto: >>>>> The fix could be to have two different locale_charset() functions, >>>>> one that returns "US-ASCII" and another one that returns "UTF-8". >>>>> The first one to be used when MB_CUR_MAX and mbrtowc() are used as >>>>> well, the second one to be used by gettext(). But the separation >>>>> line between the two cases is not yet clear to me. Any insights? >> >> The separation line is what you wrote: whether you'll use the text >> simply for presentation, or whether you'll process it before. But >> alternatively, we might try a variant of what Eric has suggested... >> >>> On OS X, can we wrap MB_CUR_MAX to pretend to be 1 when in the "C" >>> locale, to match what cygwin did in distinguishing between 'C' and >>> 'C.UTF-8'? >> >> ... which is to wrap MB_CUR_MAX and pretend that it is 3. > > Actually, MB_CUR_MAX of UTF-8 is 6, thanks to surrogate pairs.
No, it is 6 mostly thanks to the original 32-bit definition of ISO-10646. UTF-8 codes that decode to 0xD800 -> 0xDFFF are invalid. Some programs produce this encoding, but iconv will not support it on glibc and technically it's not UTF-8. However I did count wrong, MB_CUR_MAX for UTF-8 must be at least 4 to encode the 21 bits of Unicode (3 in a first byte of the form 11110bbb, 6 each in the next: 3+6*3 = 21). Paolo