Stephen Butler wrote: > POSIX says that the "C" locale should treat text data is binary input,
Can you please point to where this is written? IMO [1] describes the behaviour of the "C" locale only for characters that belong to what we know as "US-ASCII" (i.e. bytes 0x00..0x7F). As soon as you pass the string "Rémi" to a program running in the "C" locale, you are speculating on implementation-dependent behaviour. > But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1 > and various parts of sed interpret "Rémi Leblond" as an invalid > character sequence for a UTF-8 character set. Indeed, I can see how this inconsistency leads to bugs like the described ones. The fix could be to have two different locale_charset() functions, one that returns "US-ASCII" and another one that returns "UTF-8". The first one to be used when MB_CUR_MAX and mbrtowc() are used as well, the second one to be used by gettext(). But the separation line between the two cases is not yet clear to me. Any insights? Bruno [1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02