>> Can someone remind me why the build system sets the C locale in the >> first place? > > Answering my own question: I guess it is because we can't assume the > system has an en_US.UTF-8 locale or a C.UTF-8 locale?
Exactly. Some time ago (see https://lists.gnu.org/archive/html/lilypond-devel/2022-11/msg00247.html and the following mails in the thread) I worked on fixing exactly that (see attached old patch), but I stopped since we could then – at least partially – circumvent the related problems. I also stopped working on a `local-c-utf8.m4` file (also attached) to recognize the 'C.UTF-8' and/or 'en_US.UTF-8' locales, which turned out to be a really hard problem. > Maybe we ought to set only LC_MESSAGES to C? It should prevent > overriding the system locale for file names but still make logs > output in English. Honestly, I don't know. The whole locale stuff is such a mess. Werner
From 231f5c2fba0fccf333fc7873da771fc8493b71e2 Mon Sep 17 00:00:00 2001 From: Werner Lemberg <w...@gnu.org> Date: Mon, 7 Nov 2022 09:38:17 +0100 Subject: [PATCH 1/2] generic-vars.make: Fix locale for build process MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * In non-English indices of the PDF documentation, this fixes the initials for entries starting with a non-ASCII character. This is a generic problem with the `texindex` awk script (used by the Texinfo infrastructure to sort indices). https://lists.gnu.org/archive/html/bug-texinfo/2022-11/msg00008.html * Some non-ASCII characters are now correctly displayed in info, PDF, and HTML output files (example: `rotated 45?` → `rotated 45°` in the documentation of the `\rotate` markup). Also update documentation accordingly. --- Documentation/en/contributor/lsr-work.itexi | 15 ++++++++----- Documentation/en/usage/running.itely | 2 +- Documentation/es/usage/running.itely | 3 ++- Documentation/fr/usage/running.itely | 2 +- Documentation/it/usage/running.itely | 2 +- Documentation/ja/usage/running.itely | 2 +- make/generic-vars.make | 25 ++++++++++++++++++--- python/book_latex.py | 2 +- python/book_texinfo.py | 2 +- 9 files changed, 39 insertions(+), 16 deletions(-) diff --git a/Documentation/en/contributor/lsr-work.itexi b/Documentation/en/contributor/lsr-work.itexi index b03ebf8122..c69d8ca485 100644 --- a/Documentation/en/contributor/lsr-work.itexi +++ b/Documentation/en/contributor/lsr-work.itexi @@ -445,12 +445,15 @@ updating the binary running the LSR. @item Download the latest snippet tarball from -@uref{https://lsr.di.unimi.it/download/} and extract it. -The relevant files can be found in the @file{all} subdirectory. -Make sure your shell is using an English language version, for -example @code{LANG=en_US}, then run @command{convert-ly} on all -the files. Use the command-line option @code{--to=version} to -ensure the snippets are updated to the correct stable version. +@uref{https://lsr.di.unimi.it/download/} and extract it; the +relevant files can be found in the @file{all} subdirectory. + +Now run @code{LANG=C.UTF-8 convert-ly} on all the files. The +@code{LANG=C.UTF-8} part sets the @code{LANG} environment variable +to a locale with UTF-8 encoding and disabled translations while +@command{convert-ly} gets executed. Also use the command-line +option @code{--to=@var{version}} to ensure the snippets are +updated to the correct stable version. @item Make sure that you are using @command{convert-ly} from the latest diff --git a/Documentation/en/usage/running.itely b/Documentation/en/usage/running.itely index a24c7ac392..029d1ae446 100644 --- a/Documentation/en/usage/running.itely +++ b/Documentation/en/usage/running.itely @@ -925,7 +925,7 @@ overrides the value derived from the location of the @item LANG The language for LilyPond data sent to @code{stdout} and @code{stderr}, for example progress reports, warning messages, or -debug output. Example: @code{LANG=de}. +debug output. Example: @samp{LANG=de_DE.UTF-8}. @item LILYPOND_LOGLEVEL The default loglevel. If LilyPond is called without an explicit diff --git a/Documentation/es/usage/running.itely b/Documentation/es/usage/running.itely index 7f2629cd7d..7c722c50af 100644 --- a/Documentation/es/usage/running.itely +++ b/Documentation/es/usage/running.itely @@ -977,7 +977,8 @@ ubicación del binario @command{lilypond}. @item LANG Idioma de los datos de LilyPond enviados a @code{stdout} y a @code{stderr}, por ejemplo informes del avance, mensajes de -advertencia o salida de depuración. Ejemplo: @code{LANG=es}. +advertencia o salida de depuración. Ejemplo: +@samp{LANG=es_ES.UTF-8}. @item LILYPOND_LOGLEVEL Nivel de registro predeterminado. Si LilyPond se llama sin ningún diff --git a/Documentation/fr/usage/running.itely b/Documentation/fr/usage/running.itely index 5295ab0638..752ace6a2a 100644 --- a/Documentation/fr/usage/running.itely +++ b/Documentation/fr/usage/running.itely @@ -982,7 +982,7 @@ partir d'où réside le binaire @command{lilypond}. Cette variable détermine la langue dans laquelle sont émises les données sur @code{stdout} (sortie standard) et @code{stderr} (sortie des erreurs), pour afficher la progession, les avertissements ou messages de -débogage. Par exemple : @code{LANG=de}. +débogage. Par exemple : @samp{LANG=fr_FR.UTF-8}. @item LILYPOND_LOGLEVEL Cette variable détermine le niveau par défaut de verbosité. En diff --git a/Documentation/it/usage/running.itely b/Documentation/it/usage/running.itely index 6e8429660c..4b30279047 100644 --- a/Documentation/it/usage/running.itely +++ b/Documentation/it/usage/running.itely @@ -974,7 +974,7 @@ binario @command{lilypond}. @item LANG La lingua per i dati LilyPond inviati a @code{stdout} e @code{stderr}, per esempio relazioni sullo stato di avanzamento, messaggi di avviso o informazioni -di debug. Esempio: @code{LANG=de}. +di debug. Esempio: @samp{LANG=it_IT.UTF-8}. @item LILYPOND_LOGLEVEL Il livello di log (loglevel) predefinito. Se LilyPond viene chiamato senza un diff --git a/Documentation/ja/usage/running.itely b/Documentation/ja/usage/running.itely index 87201b2141..5c1d0499a1 100644 --- a/Documentation/ja/usage/running.itely +++ b/Documentation/ja/usage/running.itely @@ -958,7 +958,7 @@ SVG ビュアーが対応していないことがあるので、@c @item LANG @code{stdout} および @code{stderr} に送信される LilyPond データ、@c たとえば、進捗レポート、警告メッセージ、デバッグ出力などの言語を@c -選択します。例: @code{LANG=de} +選択します。例: @code{LANG=ja_JP.UTF-8} @item LILYPOND_LOGLEVEL デフォルトのログレベル。@c diff --git a/make/generic-vars.make b/make/generic-vars.make index 481e29c420..d62d9ddb28 100644 --- a/make/generic-vars.make +++ b/make/generic-vars.make @@ -60,10 +60,29 @@ TOPLEVEL_VERSION=$(TOPLEVEL_MAJOR_VERSION).$(TOPLEVEL_MINOR_VERSION).$(TOPLEVEL_ endif -# no locale settings in the build process. -LANG=C +# The locale settings in the build process. + +# Note that `LANG=C` doesn't work correctly: without a UTF-8 locale, +# some programs like `texindex` (used by Texinfo to sort indices) emit +# single-byte output under certain conditions, i.e., the created files +# can contain bytes that are not valid UTF-8. +LANG=C.UTF-8 export LANG +# Contrary to `LANG=C`, the `LANGUAGE` environment variable is *not* +# ignored for `LANG=C.UTF-8`, at least not on all platforms that use +# the 'glibc' library: as of November 2022, only the forthcoming +# 'glibc' version 2.36 will correctly ignore `LANGUAGE` for the +# 'C.UTF-8' locale. +# +# https://sourceware.org/bugzilla/show_bug.cgi?id=29777 +# +# We have thus to enforce the default language, otherwise +# `ly:generate-warning` in regression fails because messages get +# translated. +LANGUAGE= +export LANGUAGE + # texi2html iterates over section headers stored as entries of a map. # Disable Perl's hash randomization to make the order reproducible. @@ -108,7 +127,7 @@ script-dir = $(src-depth)/scripts export PYTHONPATH:=$(auxpython-dir):$(PYTHONPATH) MAKEINFO_FLAGS += --enable-encoding --error-limit=0 -MAKEINFO = LANG=C $(MAKEINFO_PROGRAM) $(MAKEINFO_FLAGS) +MAKEINFO = LANG=C.UTF-8 LANGUAGE= $(MAKEINFO_PROGRAM) $(MAKEINFO_FLAGS) # texi2html v5 has fatal errors in the build, so only be strict about # errors in the version we officially support diff --git a/python/book_latex.py b/python/book_latex.py index ab8b2c7d3e..806f58cf6b 100644 --- a/python/book_latex.py +++ b/python/book_latex.py @@ -214,7 +214,7 @@ def get_latex_textwidth(source, global_options): cmd = '%s %s' % (global_options.latex_program, tmpfile) ly.debug_output("Executing: %s\n" % cmd) run_env = os.environ.copy() - run_env['LC_ALL'] = 'C' + run_env['LC_ALL'] = 'C.UTF-8' run_env['TEXINPUTS'] = os.path.pathsep.join( (global_options.input_dir, run_env.get('TEXINPUTS', ''))) diff --git a/python/book_texinfo.py b/python/book_texinfo.py index d1e429f6cc..c44f23560e 100644 --- a/python/book_texinfo.py +++ b/python/book_texinfo.py @@ -232,7 +232,7 @@ def get_texinfo_width_indent(source, global_options): global_options.texinfo_program, outfile, tmpfile) ly.debug_output("Executing: %s\n" % cmd) run_env = os.environ.copy() - run_env['LC_ALL'] = 'C' + run_env['LC_ALL'] = 'C.UTF-8' # unknown why this is necessary universal_newlines = True -- 2.38.1
dnl locale-c-utf8.m4 -*-shell-script-*- dnl Copyright (C) 2003, 2005-2018, 2022 Free Software Foundation, Inc. dnl dnl This file is free software; the Free Software Foundation dnl gives unlimited permission to copy and/or distribute it, dnl with or without modifications, as long as this notice is preserved. dnl From Bruno Haible and Werner Lemberg. dnl Find a 'C.UTF-8' locale encoding. dnl This file is based on `locale-de.m4` from 'gnulib'. AC_DEFUN([LOCALE_C_UTF8], [ AC_REQUIRE([AM_LANGINFO_CODESET]) AC_CACHE_CHECK([for a 'C.UTF-8' locale], [ac_cv_locale_c_utf8], [ AC_LANG_CONFTEST([AC_LANG_SOURCE([[ #include <locale.h> #include <time.h> #if HAVE_LANGINFO_CODESET # include <langinfo.h> #endif #include <stdlib.h> #include <string.h> struct tm t; char buf[16]; int main () { /* On BeOS and Haiku, locales are not implemented in libc. Rather, libintl imitates locale dependent behaviour by looking at the environment variables, and all locales use the UTF-8 encoding. */ #if !(defined __BEOS__ || defined __HAIKU__) /* Check whether the given locale name is recognized by the system. */ # if defined _WIN32 && !defined __CYGWIN__ /* On native Windows, setlocale(category, "") looks at the system settings, not at the environment variables. Also, when an encoding suffix such as ".65001" or ".54936" is specified, it succeeds but sets the LC_CTYPE category of the locale to "C". */ if (setlocale (LC_ALL, getenv ("LC_ALL")) == NULL || strcmp (setlocale (LC_CTYPE, NULL), "C") == 0) return 1; # else if (setlocale (LC_ALL, "") == NULL) return 1; # endif /* Check whether nl_langinfo(CODESET) is nonempty and not "ASCII" or "646". On Mac OS X 10.3.5 (Darwin 7.5) in the de_DE locale, nl_langinfo(CODESET) is empty, and the behaviour of Tcl 8.4 in this locale is not useful. On OpenBSD 4.0, when an unsupported locale is specified, setlocale() succeeds but then nl_langinfo(CODESET) is "646". In this situation, some unit tests fail. */ # if 0 && HAVE_LANGINFO_CODESET /* XXX: How shall this look like for 'C.utf8' or 'en_US.UTF-8'? */ { const char *cs = nl_langinfo (CODESET); if (cs[0] == '\0' || strcmp (cs, "ASCII") == 0 || strcmp (cs, "646") == 0) return 1; } # endif # ifdef __CYGWIN__ /* On Cygwin, avoid locale names without encoding suffix, because the locale_charset() function relies on the encoding suffix. Note that LC_ALL is set on the command line. */ if (strchr (getenv ("LC_ALL"), '.') == NULL) return 1; # endif /* XXX How can I test that UTF-8 encoding actually works? */ #endif return 0; } ]])]) if AC_TRY_EVAL([ac_link]) && test -s conftest$ac_exeext; then case "$host_os" in # Handle native Windows specially, because there setlocale() interprets # "ar" as "Arabic" or "Arabic_Saudi Arabia.1256", # "fr" or "fra" as "French" or "French_France.1252", # "ge"(!) or "deu"(!) as "German" or "German_Germany.1252", # "ja" as "Japanese" or "Japanese_Japan.932", # and similar. mingw*) if (LC_ALL=.65001 \ LC_TIME= \ LC_CTYPE= \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8=.65001 # Test for the hypothetical native Windows locale name. # XXX Shouldn't this be rather 'English_US.65001'? elif (LC_ALL="English_United States.65001" \ LC_TIME= \ LC_CTYPE= \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8="English_United States.65001" else # None found. ac_cv_locale_c_utf8=none fi ;; *) if (LC_ALL=C \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8=C # Setting LC_ALL is not enough. Need to set LC_TIME to empty, because # otherwise on Mac OS X 10.3.5 the LC_TIME=C from the beginning of the # configure script would override the LC_ALL setting. Likewise for # LC_CTYPE, which is also set at the beginning of the configure script. # Test for the usual locale name. elif (LC_ALL=en_US \ LC_TIME= \ LC_CTYPE= \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8=en_US else # Test for the locale name with explicit encoding suffix. if (LC_ALL=C.UTF-8 \ LC_TIME= \ LC_CTYPE= \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8=C.UTF-8 elif (LC_ALL=en_US.UTF-8 \ LC_TIME= \ LC_CTYPE= \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8=en_US.UTF-8 else # Test for the Solaris 7 locale name. if (LC_ALL=en.UTF-8 \ LC_TIME= \ LC_CTYPE= \ ./conftest; exit) 2>/dev/null; then ac_cv_locale_c_utf8=en.UTF-8 else # None found. ac_cv_locale_c_utf8=none fi fi fi ;; esac fi rm -fr conftest* ]) LOCALE_C_UTF8=$ac_cv_locale_c_utf8 AC_SUBST([LOCALE_C_UTF8]) ])