Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Bruno Haible Thu, 07 Jun 2012 05:06:07 -0700

Stephen Butler wrote:
> POSIX says that the "C" locale should treat text data is binary input,


Can you please point to where this is written?

IMO [1] describes the behaviour of the "C" locale only for characters
that belong to what we know as "US-ASCII" (i.e. bytes 0x00..0x7F).
As soon as you pass the string "Rémi" to a program running in the "C" locale,
you are speculating on implementation-dependent behaviour.

> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
> and various parts of sed interpret "Rémi Leblond" as an invalid
> character sequence for a UTF-8 character set.

Indeed, I can see how this inconsistency leads to bugs like the described
ones.

The fix could be to have two different locale_charset() functions,
one that returns "US-ASCII" and another one that returns "UTF-8".
The first one to be used when MB_CUR_MAX and mbrtowc() are used as
well, the second one to be used by gettext(). But the separation
line between the two cases is not yet clear to me. Any insights?

Bruno

[1] 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Reply via email to