>> Can someone remind me why the build system sets the C locale in the
>> first place?
> 
> Answering my own question: I guess it is because we can't assume the
> system has an en_US.UTF-8 locale or a C.UTF-8 locale?

Exactly.  Some time ago (see
https://lists.gnu.org/archive/html/lilypond-devel/2022-11/msg00247.html
and the following mails in the thread) I worked on fixing exactly that
(see attached old patch), but I stopped since we could then – at least
partially – circumvent the related problems.

I also stopped working on a `local-c-utf8.m4` file (also attached) to
recognize the 'C.UTF-8' and/or 'en_US.UTF-8' locales, which turned out
to be a really hard problem.

> Maybe we ought to set only LC_MESSAGES to C? It should prevent
> overriding the system locale for file names but still make logs
> output in English.

Honestly, I don't know.  The whole locale stuff is such a mess.


     Werner

From 231f5c2fba0fccf333fc7873da771fc8493b71e2 Mon Sep 17 00:00:00 2001
From: Werner Lemberg <w...@gnu.org>
Date: Mon, 7 Nov 2022 09:38:17 +0100
Subject: [PATCH 1/2] generic-vars.make: Fix locale for build process
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* In non-English indices of the PDF documentation, this fixes the initials
  for entries starting with a non-ASCII character.  This is a generic
  problem with the `texindex` awk script (used by the Texinfo infrastructure
  to sort indices).

    https://lists.gnu.org/archive/html/bug-texinfo/2022-11/msg00008.html

* Some non-ASCII characters are now correctly displayed in info, PDF, and
  HTML output files (example: `rotated 45?` → `rotated 45°` in the
  documentation of the `\rotate` markup).

Also update documentation accordingly.
---
 Documentation/en/contributor/lsr-work.itexi | 15 ++++++++-----
 Documentation/en/usage/running.itely        |  2 +-
 Documentation/es/usage/running.itely        |  3 ++-
 Documentation/fr/usage/running.itely        |  2 +-
 Documentation/it/usage/running.itely        |  2 +-
 Documentation/ja/usage/running.itely        |  2 +-
 make/generic-vars.make                      | 25 ++++++++++++++++++---
 python/book_latex.py                        |  2 +-
 python/book_texinfo.py                      |  2 +-
 9 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/Documentation/en/contributor/lsr-work.itexi b/Documentation/en/contributor/lsr-work.itexi
index b03ebf8122..c69d8ca485 100644
--- a/Documentation/en/contributor/lsr-work.itexi
+++ b/Documentation/en/contributor/lsr-work.itexi
@@ -445,12 +445,15 @@ updating the binary running the LSR.
 
 @item
 Download the latest snippet tarball from
-@uref{https://lsr.di.unimi.it/download/} and extract it.
-The relevant files can be found in the @file{all} subdirectory.
-Make sure your shell is using an English language version, for
-example @code{LANG=en_US}, then run @command{convert-ly} on all
-the files.  Use the command-line option @code{--to=version} to
-ensure the snippets are updated to the correct stable version.
+@uref{https://lsr.di.unimi.it/download/} and extract it; the
+relevant files can be found in the @file{all} subdirectory.
+
+Now run @code{LANG=C.UTF-8 convert-ly} on all the files.  The
+@code{LANG=C.UTF-8} part sets the @code{LANG} environment variable
+to a locale with UTF-8 encoding and disabled translations while
+@command{convert-ly} gets executed.  Also use the command-line
+option @code{--to=@var{version}} to ensure the snippets are
+updated to the correct stable version.
 
 @item
 Make sure that you are using @command{convert-ly} from the latest
diff --git a/Documentation/en/usage/running.itely b/Documentation/en/usage/running.itely
index a24c7ac392..029d1ae446 100644
--- a/Documentation/en/usage/running.itely
+++ b/Documentation/en/usage/running.itely
@@ -925,7 +925,7 @@ overrides the value derived from the location of the
 @item LANG
 The language for LilyPond data sent to @code{stdout} and
 @code{stderr}, for example progress reports, warning messages, or
-debug output.  Example: @code{LANG=de}.
+debug output.  Example: @samp{LANG=de_DE.UTF-8}.
 
 @item LILYPOND_LOGLEVEL
 The default loglevel.  If LilyPond is called without an explicit
diff --git a/Documentation/es/usage/running.itely b/Documentation/es/usage/running.itely
index 7f2629cd7d..7c722c50af 100644
--- a/Documentation/es/usage/running.itely
+++ b/Documentation/es/usage/running.itely
@@ -977,7 +977,8 @@ ubicación del binario @command{lilypond}.
 @item LANG
 Idioma de los datos de LilyPond enviados a @code{stdout} y a
 @code{stderr}, por ejemplo informes del avance, mensajes de
-advertencia o salida de depuración.  Ejemplo: @code{LANG=es}.
+advertencia o salida de depuración.  Ejemplo:
+@samp{LANG=es_ES.UTF-8}.
 
 @item LILYPOND_LOGLEVEL
 Nivel de registro predeterminado.  Si LilyPond se llama sin ningún
diff --git a/Documentation/fr/usage/running.itely b/Documentation/fr/usage/running.itely
index 5295ab0638..752ace6a2a 100644
--- a/Documentation/fr/usage/running.itely
+++ b/Documentation/fr/usage/running.itely
@@ -982,7 +982,7 @@ partir d'où réside le binaire @command{lilypond}.
 Cette variable détermine la langue dans laquelle sont émises les
 données sur @code{stdout} (sortie standard) et @code{stderr} (sortie des
 erreurs), pour afficher la progession, les avertissements ou messages de
-débogage. Par exemple : @code{LANG=de}.
+débogage. Par exemple : @samp{LANG=fr_FR.UTF-8}.
 
 @item LILYPOND_LOGLEVEL
 Cette variable détermine le niveau par défaut de verbosité.  En
diff --git a/Documentation/it/usage/running.itely b/Documentation/it/usage/running.itely
index 6e8429660c..4b30279047 100644
--- a/Documentation/it/usage/running.itely
+++ b/Documentation/it/usage/running.itely
@@ -974,7 +974,7 @@ binario @command{lilypond}.
 @item LANG
 La lingua per i dati LilyPond inviati a @code{stdout} e @code{stderr}, per
 esempio relazioni sullo stato di avanzamento, messaggi di avviso o informazioni
-di debug.  Esempio: @code{LANG=de}.
+di debug.  Esempio: @samp{LANG=it_IT.UTF-8}.
 
 @item LILYPOND_LOGLEVEL
 Il livello di log (loglevel) predefinito. Se LilyPond viene chiamato senza un
diff --git a/Documentation/ja/usage/running.itely b/Documentation/ja/usage/running.itely
index 87201b2141..5c1d0499a1 100644
--- a/Documentation/ja/usage/running.itely
+++ b/Documentation/ja/usage/running.itely
@@ -958,7 +958,7 @@ SVG ビュアーが対応していないことがあるので、@c
 @item LANG
 @code{stdout} および @code{stderr} に送信される LilyPond データ、@c
 たとえば、進捗レポート、警告メッセージ、デバッグ出力などの言語を@c
-選択します。例: @code{LANG=de}
+選択します。例: @code{LANG=ja_JP.UTF-8}
 
 @item LILYPOND_LOGLEVEL
 デフォルトのログレベル。@c
diff --git a/make/generic-vars.make b/make/generic-vars.make
index 481e29c420..d62d9ddb28 100644
--- a/make/generic-vars.make
+++ b/make/generic-vars.make
@@ -60,10 +60,29 @@ TOPLEVEL_VERSION=$(TOPLEVEL_MAJOR_VERSION).$(TOPLEVEL_MINOR_VERSION).$(TOPLEVEL_
 endif
 
 
-# no locale settings in the build process.
-LANG=C
+# The locale settings in the build process.
+
+# Note that `LANG=C` doesn't work correctly: without a UTF-8 locale,
+# some programs like `texindex` (used by Texinfo to sort indices) emit
+# single-byte output under certain conditions, i.e., the created files
+# can contain bytes that are not valid UTF-8.
+LANG=C.UTF-8
 export LANG
 
+# Contrary to `LANG=C`, the `LANGUAGE` environment variable is *not*
+# ignored for `LANG=C.UTF-8`, at least not on all platforms that use
+# the 'glibc' library: as of November 2022, only the forthcoming
+# 'glibc' version 2.36 will correctly ignore `LANGUAGE` for the
+# 'C.UTF-8' locale.
+#
+#   https://sourceware.org/bugzilla/show_bug.cgi?id=29777
+#
+# We have thus to enforce the default language, otherwise
+# `ly:generate-warning` in regression fails because messages get
+# translated.
+LANGUAGE=
+export LANGUAGE
+
 
 # texi2html iterates over section headers stored as entries of a map.
 # Disable Perl's hash randomization to make the order reproducible.
@@ -108,7 +127,7 @@ script-dir = $(src-depth)/scripts
 export PYTHONPATH:=$(auxpython-dir):$(PYTHONPATH)
 
 MAKEINFO_FLAGS += --enable-encoding --error-limit=0
-MAKEINFO = LANG=C $(MAKEINFO_PROGRAM) $(MAKEINFO_FLAGS)
+MAKEINFO = LANG=C.UTF-8 LANGUAGE= $(MAKEINFO_PROGRAM) $(MAKEINFO_FLAGS)
 
 # texi2html v5 has fatal errors in the build, so only be strict about
 # errors in the version we officially support
diff --git a/python/book_latex.py b/python/book_latex.py
index ab8b2c7d3e..806f58cf6b 100644
--- a/python/book_latex.py
+++ b/python/book_latex.py
@@ -214,7 +214,7 @@ def get_latex_textwidth(source, global_options):
     cmd = '%s %s' % (global_options.latex_program, tmpfile)
     ly.debug_output("Executing: %s\n" % cmd)
     run_env = os.environ.copy()
-    run_env['LC_ALL'] = 'C'
+    run_env['LC_ALL'] = 'C.UTF-8'
     run_env['TEXINPUTS'] = os.path.pathsep.join(
                              (global_options.input_dir,
                               run_env.get('TEXINPUTS', '')))
diff --git a/python/book_texinfo.py b/python/book_texinfo.py
index d1e429f6cc..c44f23560e 100644
--- a/python/book_texinfo.py
+++ b/python/book_texinfo.py
@@ -232,7 +232,7 @@ def get_texinfo_width_indent(source, global_options):
         global_options.texinfo_program, outfile, tmpfile)
     ly.debug_output("Executing: %s\n" % cmd)
     run_env = os.environ.copy()
-    run_env['LC_ALL'] = 'C'
+    run_env['LC_ALL'] = 'C.UTF-8'
 
     # unknown why this is necessary
     universal_newlines = True
-- 
2.38.1

dnl locale-c-utf8.m4   -*-shell-script-*-

dnl Copyright (C) 2003, 2005-2018, 2022 Free Software Foundation, Inc.
dnl
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
dnl with or without modifications, as long as this notice is preserved.

dnl From Bruno Haible and Werner Lemberg.

dnl Find a 'C.UTF-8' locale encoding.
dnl This file is based on `locale-de.m4` from 'gnulib'.
AC_DEFUN([LOCALE_C_UTF8],
[
  AC_REQUIRE([AM_LANGINFO_CODESET])
  AC_CACHE_CHECK([for a 'C.UTF-8' locale], [ac_cv_locale_c_utf8], [
    AC_LANG_CONFTEST([AC_LANG_SOURCE([[
#include <locale.h>
#include <time.h>
#if HAVE_LANGINFO_CODESET
# include <langinfo.h>
#endif
#include <stdlib.h>
#include <string.h>
struct tm t;
char buf[16];
int main () {
  /* On BeOS and Haiku, locales are not implemented in libc.  Rather, libintl
     imitates locale dependent behaviour by looking at the environment
     variables, and all locales use the UTF-8 encoding.  */
#if !(defined __BEOS__ || defined __HAIKU__)
  /* Check whether the given locale name is recognized by the system.  */
# if defined _WIN32 && !defined __CYGWIN__
  /* On native Windows, setlocale(category, "") looks at the system settings,
     not at the environment variables.  Also, when an encoding suffix such
     as ".65001" or ".54936" is specified, it succeeds but sets the LC_CTYPE
     category of the locale to "C".  */
  if (setlocale (LC_ALL, getenv ("LC_ALL")) == NULL
      || strcmp (setlocale (LC_CTYPE, NULL), "C") == 0)
    return 1;
# else
  if (setlocale (LC_ALL, "") == NULL) return 1;
# endif


  /* Check whether nl_langinfo(CODESET) is nonempty and not "ASCII" or "646".
     On Mac OS X 10.3.5 (Darwin 7.5) in the de_DE locale, nl_langinfo(CODESET)
     is empty, and the behaviour of Tcl 8.4 in this locale is not useful.
     On OpenBSD 4.0, when an unsupported locale is specified, setlocale()
     succeeds but then nl_langinfo(CODESET) is "646". In this situation,
     some unit tests fail.  */
# if 0 && HAVE_LANGINFO_CODESET
  /* XXX: How shall this look like for 'C.utf8' or 'en_US.UTF-8'? */
  {
    const char *cs = nl_langinfo (CODESET);
    if (cs[0] == '\0' || strcmp (cs, "ASCII") == 0 || strcmp (cs, "646") == 0)
      return 1;
  }
# endif


# ifdef __CYGWIN__
  /* On Cygwin, avoid locale names without encoding suffix, because the
     locale_charset() function relies on the encoding suffix.  Note that
     LC_ALL is set on the command line.  */
  if (strchr (getenv ("LC_ALL"), '.') == NULL) return 1;
# endif


  /* XXX How can I test that UTF-8 encoding actually works? */


#endif
  return 0;
}
      ]])])
    if AC_TRY_EVAL([ac_link]) && test -s conftest$ac_exeext; then
      case "$host_os" in
        # Handle native Windows specially, because there setlocale() interprets
        # "ar" as "Arabic" or "Arabic_Saudi Arabia.1256",
        # "fr" or "fra" as "French" or "French_France.1252",
        # "ge"(!) or "deu"(!) as "German" or "German_Germany.1252",
        # "ja" as "Japanese" or "Japanese_Japan.932",
        # and similar.
        mingw*)
          if (LC_ALL=.65001 \
              LC_TIME= \
              LC_CTYPE= \
              ./conftest; exit) 2>/dev/null; then
            ac_cv_locale_c_utf8=.65001
          # Test for the hypothetical native Windows locale name.
          # XXX Shouldn't this be rather 'English_US.65001'?
          elif (LC_ALL="English_United States.65001" \
                LC_TIME= \
                LC_CTYPE= \
                ./conftest; exit) 2>/dev/null; then
            ac_cv_locale_c_utf8="English_United States.65001"
          else
            # None found.
            ac_cv_locale_c_utf8=none
          fi
          ;;
        *)
          if (LC_ALL=C \
              ./conftest; exit) 2>/dev/null; then
            ac_cv_locale_c_utf8=C
          # Setting LC_ALL is not enough. Need to set LC_TIME to empty, because
          # otherwise on Mac OS X 10.3.5 the LC_TIME=C from the beginning of the
          # configure script would override the LC_ALL setting. Likewise for
          # LC_CTYPE, which is also set at the beginning of the configure 
script.
          # Test for the usual locale name.
          elif (LC_ALL=en_US \
                LC_TIME= \
                LC_CTYPE= \
                ./conftest; exit) 2>/dev/null; then
            ac_cv_locale_c_utf8=en_US
          else
            # Test for the locale name with explicit encoding suffix.
            if (LC_ALL=C.UTF-8 \
                LC_TIME= \
                LC_CTYPE= \
                ./conftest; exit) 2>/dev/null; then
              ac_cv_locale_c_utf8=C.UTF-8
            elif (LC_ALL=en_US.UTF-8 \
                  LC_TIME= \
                  LC_CTYPE= \
                  ./conftest; exit) 2>/dev/null; then
              ac_cv_locale_c_utf8=en_US.UTF-8
            else
              # Test for the Solaris 7 locale name.
              if (LC_ALL=en.UTF-8 \
                  LC_TIME= \
                  LC_CTYPE= \
                  ./conftest; exit) 2>/dev/null; then
                ac_cv_locale_c_utf8=en.UTF-8
              else
                # None found.
                ac_cv_locale_c_utf8=none
              fi
            fi
          fi
          ;;
      esac
    fi
    rm -fr conftest*
  ])
  LOCALE_C_UTF8=$ac_cv_locale_c_utf8
  AC_SUBST([LOCALE_C_UTF8])
])

Reply via email to