On 06/07/2012 06:07 AM, Bruno Haible wrote: > Stephen Butler wrote: >> POSIX says that the "C" locale should treat text data is binary input, > > Can you please point to where this is written?
POSIX doesn't explicitly say that the "C" locale treats text as binary, although it does have several interfaces that are well defined across _all_ bytes, even when those bytes fall outside the portable character set of the first 0-127 values. For example: http://pubs.opengroup.org/onlinepubs/9699919799/functions/strcasecmp.html >> When the LC_CTYPE category of the current locale is from the POSIX locale, >> strcasecmp() and strncasecmp() shall behave as if the strings had been >> converted to lowercase and then a byte comparison performed. Otherwise, the >> results are unspecified. > > IMO [1] describes the behaviour of the "C" locale only for characters > that belong to what we know as "US-ASCII" (i.e. bytes 0x00..0x7F). You are correct - by stating that the "C" locale is only guaranteed to support 0-127, POSIX is intentionally allowing a C locale to support additional characters as an extension, and even permits a C locale with multibyte encoding (that is, I think that a C locale written with UTF-8 character encoding and MB_CUR_MAX>1 would comply with POSIX, because the only portable actions on characters will not pass any 8-bit bytes to interfaces not defined to operate on all bytes). However, experience with cygwin shows that when cygwin 1.7.0 attempted to make 'C' use UTF-8 by default, it flushed out enough programs that were unprepared to deal with MB_CUR_MAX > 1 in the C locale, and as a result, current cygwin now uses 'C' with MB_CUR_MAX==1 and 'C.UTF-8' for multibyte handling. (Note that most of the problems could probably be traced to bugs in programs that were making non-standard assumptions, but it was easier for cygwin to comply with those assumptions than it was to propose compliance patches to all the affected programs). > As soon as you pass the string "Rémi" to a program running in the "C" locale, > you are speculating on implementation-dependent behaviour. Very much true. > >> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1 >> and various parts of sed interpret "Rémi Leblond" as an invalid >> character sequence for a UTF-8 character set. > > Indeed, I can see how this inconsistency leads to bugs like the described > ones. > > The fix could be to have two different locale_charset() functions, > one that returns "US-ASCII" and another one that returns "UTF-8". > The first one to be used when MB_CUR_MAX and mbrtowc() are used as > well, the second one to be used by gettext(). But the separation > line between the two cases is not yet clear to me. Any insights? On OS X, can we wrap MB_CUR_MAX to pretend to be 1 when in the "C" locale, to match what cygwin did in distinguishing between 'C' and 'C.UTF-8'? > > Bruno > > [1] > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02 > > > > -- Eric Blake ebl...@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature