On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote:
> C, or C.UTF-8 is not a universal locale which works
> for all.

Sure, and I don't think anyone is arguing that you or anyone else should
set the locale for your interactive terminal session, your GUI desktop
environment, or even your servers to C.UTF-8.

But, this thread is about build environments for our packages, not about
runtime environments. We have two-and-a-half possible policies:

1. Status quo, in theory:

   Packages cannot make any assumptions about build-time locales.

   The benefits are:

   - Diagnostic messages are in the maintainer's local language, and
     potentially easier to understand.

   - If a mass-QA effort wants to assess whether the program is broken by
     a particular locale, they can easily try running its build-time tests
     in that locale, **if** the tests do not already force a different
     locale. (But this comes with some serious limitations: it's likely
     to have a significant number of false-positive situations where the
     program is actually working perfectly but the **tests** make assumptions
     that are not true in all locales, and as a result many upstream
     projects set their build-time tests to force specific locales
     anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
     might prefer in Debian.)

   The costs are:

   - Every program that might be run at build-time is expected to continue
     to cope with running in non-UTF-8 locales, even if we strongly deprecate
     non-UTF-8 locales for production use.

   - Diagnostic messages from the reproducible-builds infrastructure are
     in a random language chosen by the infrastructure, which the maintainer
     does not necessarily understand. (If my package fails to build in a
     Chinese locale, that's a valid bug, but if I'm expected to diagnose the
     problem by reading Chinese error messages, as a non-Chinese-speaker I
     am not going to get far.)

   - If a program that is run during build intentionally has locale-specific
     output, and its output ends up in the .deb, then the package maintainer
     must go to additional effort to force that particular program to have
     reproducible output, usually by running it in a specified locale.

2. What's being proposed in this thread:

   Each package can assume that it's built in the C.UTF-8 locale.
   If it needs a different locale during testing, it can set that itself
   (as e.g. glib2.0 does for some tests), but unless it takes explicit
   action, C.UTF-8 will be used.

   The benefit is that packages that require a UTF-8 locale during build
   or during testing (e.g. to process non-English strings in Python)
   can assume that they have one, and an equivalence class of bugs
   (packages where the content of the .deb can vary with the build-time
   locale, or where e.g. build-time tests fail if UTF-8 output is not
   possible) become non-bugs that we do not need to think about.

   The costs are that we don't get the benefits from (1.) any more.

2½. Unwelcome compromise (increasingly the status quo):

   Whenever a package is non-reproducible, fails to build or fails tests
   in certain locales (for example legacy non-UTF-8 locales like C or
   en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
   move on.

   This is just (2.) with extra steps, and has the same benefit and cost
   for the affected packages as (2.) plus an additional cost (someone must
   identify that the package is in this category and copy/paste the extra
   line), and the same benefit and costs for unmodified packages as (1.).

2½ seems like the same boil-the-ocean pattern as any number of
manual-work-intensive transitions: Rules-Requires-Root, debhelper compat
levels, compiler hardening flags and so on. In situations where the
desired state is a backwards-compatibility break, the benefit of having
the transition be opt-in can exceed its (considerable!) cost, but we
shouldn't let that trick us into always paying the additional cost of an
opt-in transition, even in situations where it isn't worth it.

> [Turkish dotted/dotless i]
> creates tons of problems with software which are not aware of the
> issue  (Kodi completely breaks for example, and some software needs
> forced/custom environments to run).

I agree that internationalization issues can be a serious problem **at
runtime**, and when our developers and users find such problems, they can
be reported as bugs downstream or upstream, and (hopefully!) fixed. What
I do not agree with is your suggestion that having the package build
occur in an undefined locale will solve this problem.

For example, let's imagine that we decide that perfect support for Turkish
is a release goal. Having reproducible-builds.org build packages in an
arbitrary language (in practice French is often used, I think?) doesn't
prove anything about whether they handle Turkish correctly, whatever
"correctly" might mean.

If someone wants to do a QA mass-rebuild in the tr_TR.UTF-8 locale,
that would come a little closer to having higher confidence about our
ability to run software in Turkish - but is it working *correctly*, or
are the tests making the wrong assertions, or are the code paths that
could go wrong in Turkish not even being tested? We probably won't know
any of those until a Turkish speaker investigates that specific piece
of software.

The fact that you say "Kodi completely breaks" also suggests to me that
fixing these problems is not trivial, because if it was easy, it would
have been fixed by now. And yet we ship Kodi in Debian, even knowing
that it has this bug, and it seems to work OK for most people.

Even if Kodi's problems with Turkish text are solved, **and** the
developer who solves those problems adds a build-time regression test
to avoid the bug coming back, I would expect the test to need to look
like this pseudocode:

    def test_turkish:
        old_locale = setlocale(LC_ALL, "tr_TR.UTF-8")

        if old_locale is null:
            skipTest("tr_TR.UTF-8 locale not available, try installing 
locales-all")

        try:
            do some stuff involving Turkish text
            assert that the right thing happens
        finally:
            setlocale(LC_ALL, old_locale)

... for which having the rest of the build happen in the tr_TR.UTF-8
locale isn't even useful!

(src:glib2.0 has several tests like this, and the packaging goes to some
lengths to make sure that the required locales are available.)

A wider point here is that artificially elevating a certain class of bugs
to be de-facto release-critical by turning them into build failures is
not necessarily always going to improve the quality of Debian: we have
no shortage of bugs to work on, and a finite amount of volunteer time
available. Any time we make a class of bugs release-critical like this,
that's taking volunteer time away from identifying and fixing different
bugs that might have a larger impact on the overall quality of the
distribution, so we should only do this if we are sure that that class
of bugs is genuinely among our highest priorities.

Stepping back from the specifics of locales, I observe that operating
systems are extremely complicated and contain an overwhelming number
of choices and code paths. Obviously most of those choices are there
because someone needs them - but some are only there for historical
reasons or as an unintended side-effect of something more beneficial. If
we can make a simplifying assumption that will take an entire equivalence
class of bugs and make them into non-bugs, without losing significant
functionality or flexibility, then it's often good to do that instead.

(For example, a while ago we replaced "it is undefined whether /usr is
mounted or not during early boot" with the simplifying assumption "if
/usr is separate then it must be mounted by the initramfs", which turned a
whole class of bugs of the form "x is in /lib but depends on y which is in
/usr/lib" into non-bugs that do not need to be fixed or even identified.)

    smcv

Reply via email to