Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Eric Blake Thu, 07 Jun 2012 05:50:39 -0700

On 06/07/2012 06:07 AM, Bruno Haible wrote:
> Stephen Butler wrote:
>> POSIX says that the "C" locale should treat text data is binary input,
> 
> Can you please point to where this is written?


POSIX doesn't explicitly say that the "C" locale treats text as binary,
although it does have several interfaces that are well defined across
_all_ bytes, even when those bytes fall outside the portable character
set of the first 0-127 values.  For example:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcasecmp.html
>> When the LC_CTYPE category of the current locale is from the POSIX locale, 
>> strcasecmp() and strncasecmp() shall behave as if the strings had been 
>> converted to lowercase and then a byte comparison performed. Otherwise, the 
>> results are unspecified.

> 
> IMO [1] describes the behaviour of the "C" locale only for characters
> that belong to what we know as "US-ASCII" (i.e. bytes 0x00..0x7F).

You are correct - by stating that the "C" locale is only guaranteed to
support 0-127, POSIX is intentionally allowing a C locale to support
additional characters as an extension, and even permits a C locale with
multibyte encoding (that is, I think that a C locale written with UTF-8
character encoding and MB_CUR_MAX>1 would comply with POSIX, because the
only portable actions on characters will not pass any 8-bit bytes to
interfaces not defined to operate on all bytes).  However, experience
with cygwin shows that when cygwin 1.7.0 attempted to make 'C' use UTF-8
by default, it flushed out enough programs that were unprepared to deal
with MB_CUR_MAX > 1 in the C locale, and as a result, current cygwin now
uses 'C' with MB_CUR_MAX==1 and 'C.UTF-8' for multibyte handling.  (Note
that most of the problems could probably be traced to bugs in programs
that were making non-standard assumptions, but it was easier for cygwin
to comply with those assumptions than it was to propose compliance
patches to all the affected programs).

> As soon as you pass the string "Rémi" to a program running in the "C" locale,
> you are speculating on implementation-dependent behaviour.

Very much true.

> 
>> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
>> and various parts of sed interpret "Rémi Leblond" as an invalid
>> character sequence for a UTF-8 character set.
> 
> Indeed, I can see how this inconsistency leads to bugs like the described
> ones.
> 
> The fix could be to have two different locale_charset() functions,
> one that returns "US-ASCII" and another one that returns "UTF-8".
> The first one to be used when MB_CUR_MAX and mbrtowc() are used as
> well, the second one to be used by gettext(). But the separation
> line between the two cases is not yet clear to me. Any insights?

On OS X, can we wrap MB_CUR_MAX to pretend to be 1 when in the "C"
locale, to match what cygwin did in distinguishing between 'C' and
'C.UTF-8'?

> 
> Bruno
> 
> [1] 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02
> 
> 
> 
> 

-- 
Eric Blake   ebl...@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

Reply via email to