Paul Eggert reported: > <http://lists.gnu.org/archive/html/bug-bison/2012-01/msg00107.html>.
Akim Demaille wrote: > I'm sending this message to you as the main author of > the quotearg module. I am not sure which component should > be considered guilty here, but the problem is: > > - independently of any LC_*, localcharset.c returns UTF-8 > on OS X. > > - If I instrument localcharset.c, I can see that the OS > returns "US-ASCII" as locale_codeset. > > - localcharset's get_charset_aliases then maps US-ASCII > to UTF-8 ... > > - so quotearg decides to use nice UTF-8 quotes (since > quote.c asks for locale-dependent quotes). > > - so the test suite fails since it expects plain old "'". > > What module would be considered faulty here? The test suite is faulty. Rationale: - The localcharset.c code is meant to return the character encoding in the current locale. Pretty much like nl_langinfo(CODESET), except that the latter is botched on many systems: on some it returns non-standard encoding names such as "646", on some an empty string, and on some (such as Cygwin or MacOS X) it returns "US-ASCII" when in reality the character encoding is different. localcharset.c can be seen as an override of nl_langinfo (CODESET), except that it does not (yet) have the form of a gnulib-style override. - POSIX [1] does not specify the character encoding of the "C" locale. It could be US-ASCII or any extension of it, such as ISO-8859-1 or UTF-8. - On MacOS X the Terminal.app's encoding and the general text encoding are UTF-8. - On MacOS X nearly all users are working in the "C" locale. If a user has told the OS that he's working in the French locale, the OS does not set LC_* variables to indicate this, nor does the user usually do so (why should he? he has already specified it once). Therefore the normal situation on MacOS X is this: $ env | grep LC_ $ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL= - gettext() takes care to transliterate messages to the locale encoding. If locale_charset() is "UTF-8", 'rm --help' will show for a French user Usage: rm [OPTION]... FICHIER... Supprime (défait le lien) les FILE(s). ... and for a Chinese user 用法:rm [选项]... 文件... Remove (unlink) the FILE(s). If locale_charset() is "US-ASCII", 'rm --help' will show instead: Usage: rm [OPTION]... FICHIER... Supprime (d'efait le lien) les FILE(s). and for a Chinese user no translation at all: Usage: rm [OPTION]... FILE... Remove (unlink) the FILE(s). - quotearg's use of gettext() and locale_charset() to determine whether to use ‘...’ instead of '...' is entirely appropriate, because 1. In situations where gettext() is known to make use of non-ASCII characters in its resulting strings, it is also OK for quotearg to make use of such characters. 2. quotearg is not used in places where POSIX demands a certain result in the "C" locale. In <http://lists.gnu.org/archive/html/bug-bison/2012-01/msg00091.html> Akim also wrote: > I had never realized that the tests are not specifying LC_ALL=C > and they should. But even when I do, I still have nice quotes. Indeed there is a slight difference in behaviour between gettext() and locale_charset(): Setting the environment variable LC_ALL=C disables all translations in gettext() - this is needed so that some coreutils programs can be POSIX compliant -, whereas locale_charset() doesn't have this special code. There are several systems with locale encoding UTF-8 in the all user locales: Plan 9, BeOS, Haiku, MacOS X, Cygwin 1.7, and there will be more, because it's a natural choice nowadays. In such environments, it makes less and less sense to assign the US-ASCII encoding to the "C" locale. US-ASCII encoding was a good choice for the "C" locale between 1996-2001, as a transition between the ISO-8859-1 world and the UTF-8 world. It isn't any more. Let's fix the testsuites. Paul Eggert wrote: > Does the following gnulib patch fix things for Bison on OS X? > I'll CC: this to bug-gnulib@gnu.org, to give Bruno Haible > a heads-up about the localcharset problem. > > localcharset: port to Mac OS X's C locale > * lib/localcharset.c (get_charset_aliases) [DARWIN7]: > Map "US-ASCII" to "ASCII". Problem reported by Akim Demaille in > diff --git a/lib/localcharset.c b/lib/localcharset.c > index d86002c..68ccf60 100644 > --- a/lib/localcharset.c > +++ b/lib/localcharset.c > @@ -262,6 +262,7 @@ get_charset_aliases (void) > "ISO8859-9" "\0" "ISO-8859-9" "\0" > "ISO8859-13" "\0" "ISO-8859-13" "\0" > "ISO8859-15" "\0" "ISO-8859-15" "\0" > + "US-ASCII" "\0" "ASCII" "\0" > "KOI8-R" "\0" "KOI8-R" "\0" > "KOI8-U" "\0" "KOI8-U" "\0" > "CP866" "\0" "CP866" "\0" Nah. "Let's break gettext() based internationalization of all GNU programs for most MacOS X users" won't get my approval. Bruno [1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html section 7.2 [2] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/df.html