On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote: > C, or C.UTF-8 is not a universal locale which works > for all.
Sure, and I don't think anyone is arguing that you or anyone else should set the locale for your interactive terminal session, your GUI desktop environment, or even your servers to C.UTF-8. But, this thread is about build environments for our packages, not about runtime environments. We have two-and-a-half possible policies: 1. Status quo, in theory: Packages cannot make any assumptions about build-time locales. The benefits are: - Diagnostic messages are in the maintainer's local language, and potentially easier to understand. - If a mass-QA effort wants to assess whether the program is broken by a particular locale, they can easily try running its build-time tests in that locale, **if** the tests do not already force a different locale. (But this comes with some serious limitations: it's likely to have a significant number of false-positive situations where the program is actually working perfectly but the **tests** make assumptions that are not true in all locales, and as a result many upstream projects set their build-time tests to force specific locales anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we might prefer in Debian.) The costs are: - Every program that might be run at build-time is expected to continue to cope with running in non-UTF-8 locales, even if we strongly deprecate non-UTF-8 locales for production use. - Diagnostic messages from the reproducible-builds infrastructure are in a random language chosen by the infrastructure, which the maintainer does not necessarily understand. (If my package fails to build in a Chinese locale, that's a valid bug, but if I'm expected to diagnose the problem by reading Chinese error messages, as a non-Chinese-speaker I am not going to get far.) - If a program that is run during build intentionally has locale-specific output, and its output ends up in the .deb, then the package maintainer must go to additional effort to force that particular program to have reproducible output, usually by running it in a specified locale. 2. What's being proposed in this thread: Each package can assume that it's built in the C.UTF-8 locale. If it needs a different locale during testing, it can set that itself (as e.g. glib2.0 does for some tests), but unless it takes explicit action, C.UTF-8 will be used. The benefit is that packages that require a UTF-8 locale during build or during testing (e.g. to process non-English strings in Python) can assume that they have one, and an equivalence class of bugs (packages where the content of the .deb can vary with the build-time locale, or where e.g. build-time tests fail if UTF-8 output is not possible) become non-bugs that we do not need to think about. The costs are that we don't get the benefits from (1.) any more. 2½. Unwelcome compromise (increasingly the status quo): Whenever a package is non-reproducible, fails to build or fails tests in certain locales (for example legacy non-UTF-8 locales like C or en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and move on. This is just (2.) with extra steps, and has the same benefit and cost for the affected packages as (2.) plus an additional cost (someone must identify that the package is in this category and copy/paste the extra line), and the same benefit and costs for unmodified packages as (1.). 2½ seems like the same boil-the-ocean pattern as any number of manual-work-intensive transitions: Rules-Requires-Root, debhelper compat levels, compiler hardening flags and so on. In situations where the desired state is a backwards-compatibility break, the benefit of having the transition be opt-in can exceed its (considerable!) cost, but we shouldn't let that trick us into always paying the additional cost of an opt-in transition, even in situations where it isn't worth it. > [Turkish dotted/dotless i] > creates tons of problems with software which are not aware of the > issue (Kodi completely breaks for example, and some software needs > forced/custom environments to run). I agree that internationalization issues can be a serious problem **at runtime**, and when our developers and users find such problems, they can be reported as bugs downstream or upstream, and (hopefully!) fixed. What I do not agree with is your suggestion that having the package build occur in an undefined locale will solve this problem. For example, let's imagine that we decide that perfect support for Turkish is a release goal. Having reproducible-builds.org build packages in an arbitrary language (in practice French is often used, I think?) doesn't prove anything about whether they handle Turkish correctly, whatever "correctly" might mean. If someone wants to do a QA mass-rebuild in the tr_TR.UTF-8 locale, that would come a little closer to having higher confidence about our ability to run software in Turkish - but is it working *correctly*, or are the tests making the wrong assertions, or are the code paths that could go wrong in Turkish not even being tested? We probably won't know any of those until a Turkish speaker investigates that specific piece of software. The fact that you say "Kodi completely breaks" also suggests to me that fixing these problems is not trivial, because if it was easy, it would have been fixed by now. And yet we ship Kodi in Debian, even knowing that it has this bug, and it seems to work OK for most people. Even if Kodi's problems with Turkish text are solved, **and** the developer who solves those problems adds a build-time regression test to avoid the bug coming back, I would expect the test to need to look like this pseudocode: def test_turkish: old_locale = setlocale(LC_ALL, "tr_TR.UTF-8") if old_locale is null: skipTest("tr_TR.UTF-8 locale not available, try installing locales-all") try: do some stuff involving Turkish text assert that the right thing happens finally: setlocale(LC_ALL, old_locale) ... for which having the rest of the build happen in the tr_TR.UTF-8 locale isn't even useful! (src:glib2.0 has several tests like this, and the packaging goes to some lengths to make sure that the required locales are available.) A wider point here is that artificially elevating a certain class of bugs to be de-facto release-critical by turning them into build failures is not necessarily always going to improve the quality of Debian: we have no shortage of bugs to work on, and a finite amount of volunteer time available. Any time we make a class of bugs release-critical like this, that's taking volunteer time away from identifying and fixing different bugs that might have a larger impact on the overall quality of the distribution, so we should only do this if we are sure that that class of bugs is genuinely among our highest priorities. Stepping back from the specifics of locales, I observe that operating systems are extremely complicated and contain an overwhelming number of choices and code paths. Obviously most of those choices are there because someone needs them - but some are only there for historical reasons or as an unintended side-effect of something more beneficial. If we can make a simplifying assumption that will take an entire equivalence class of bugs and make them into non-bugs, without losing significant functionality or flexibility, then it's often good to do that instead. (For example, a while ago we replaced "it is undefined whether /usr is mounted or not during early boot" with the simplifying assumption "if /usr is separate then it must be mounted by the initramfs", which turned a whole class of bugs of the form "x is in /lib but depends on y which is in /usr/lib" into non-bugs that do not need to be fixed or even identified.) smcv