Hi! On Sat, 2024-07-06 at 13:13:48 +0200, Alexandre Detiste wrote: > Le mar. 2 juil. 2024 à 14:37, Guillem Jover <guil...@debian.org> a écrit : > > On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote: > > > On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote: > > > > On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote: > > > > > Maybe a compromise would be to at least mandate some UTF-8 locale. > > > > > > > > dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when > > > > building packages > > > > > > Allowing ASCII seems counterproductive: that puts us in the code path > > > where various tools and runtimes (especially Python) will refuse to > > > process or output anything outside the 0-127 range, which I believe is > > > exactly the problem that debhelper aims to solve by using C.UTF-8 for > > > some categories of package (in particular those that build with Meson). > > > > > > To get what Alexandre suggested, we'd need to allow UTF-8 but not allow > > > ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular > > > the C locale is not). > > > > But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when > > using --sanitive-env, right away though.
I did something like that as part of dpkg 1.22.7, with commit: https://git.dpkg.org/cgit/dpkg/dpkg.git/commit/?id=df60765ed4bc6640b788c796dd0c627d7714f807 Which should guarantee a UTF-8 codeset and stable sorting, while preserving any translated output messages (and other locale settings). > One thing that could be fixed quite quickly is fixing the few > remaining official buildd workers that does not yet run with an UTF-8 locale. Something I realized after adding the above change, is that sbuild has been running dpkg-buildpackage with --sanitize-env for a while now, which checking now I was told about at the time, but I either didn't piece together its consequences or perhaps forgot that the sbuild package is nowadays used in build daemons (instead of the old fork) and then forgot. :) (BTW, not blaming josch! I think that change in sbuild on its own makes sense, I guess I was just not expecting the option to be used that way, and perhaps its documentation should have somehow made it more clear. :) I guess this is both good and "bad". It's good because now all build daemons will have a guaranteed UTF-8 locale codeset already starting with Debian trixie, as requested in this thread, and give us a more uniform build environment. It's "bad" because part of the reason to add this through a new --sanitize-env option was to make this behavior and its guarantees opt-in, but if the official Debian builds are using this, then it's in a way equivalent to having set this by default w/o the option, but perhaps worse because people running local build will not have the same environment (although it's going to be easy to replicate by passing that option, but a bit harder when calling debian/rules directly which we still support). I'm not sure the current state is ideal, because we are back to packages being able to rely on some stuff on build daemons, that are not guaranteed by default for our supported build entry points, and if the result of this is that we end up patching all dpkg-buildpackage callers to pass --sanitize-env, then we could have as well simply changed the default instead. I think a way forward could be to make the sanitizing the default, and finally drop debian/rules as a supported (user) build entry point, I had in mind re-proposing this already, but the above kind of gives it more urgency, so I'll try to do that soon. This also means, I guess, that part of the previous freedom I thought we had to modify the --sanitize-env behavior is kind of gone now (and would be gone too if we move its behavior as the default one), and we should apply similar care as if the default itself was being changed, because it has the potential to break the archive (via build daemons). I'm thinking that depending on the changes there, it might be better to gate them via dpkg-build-api(7) levels. I should also document the vendor specific behavior in some manual page, as it is currently listed as unspecified "vendor specific". > If one is unlucky the build will mysteriously fail. > > Adding export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8 > in every single d/rules by fear of this seems overkill. > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586 > https://buildd.debian.org/status/package.php?p=xrayutilities I implemented the attached patch (also on the next/req-utf8 branch) to force a locale with a UTF-8 codeset, which would be a no-op now when using --sanitize-env, but I didn't merge that for now, because I'm not sure of the potential fallout, given that other infrastructure things might be running dpkg-buildpackage w/o passing --sanitize-env. So I think those would need to be found and changed before deploying that change. <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/log/?h=next/req-utf8> But then, I guess whether merging that makes sense or not also depends on how we want to proceed with the debian/rules build entry point, and whether we are going to switch the default or transition to amend all callers (which might still not catch private infra and similar). Thanks, Guillem
From 49dd377f0cea0d6a23ef619a2f77b268e3f5e14a Mon Sep 17 00:00:00 2001 From: Guillem Jover <guil...@debian.org> Date: Tue, 2 Jul 2024 03:34:35 +0200 Subject: [PATCH] dpkg-buildpackage: Require an UTF-8 locale when building packages For LC_COLLATE we also require C.UTF-8 or POSIX.UTF-8 to guarantee a stable sorting during builds, w/o disrupting other localization outputs that might be wanted by the person starting the build. Proposed-by: Alexandre Detiste <alexandre.deti...@gmail.com> --- scripts/dpkg-buildpackage.pl | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl index d849d6e90..e400f9da1 100755 --- a/scripts/dpkg-buildpackage.pl +++ b/scripts/dpkg-buildpackage.pl @@ -26,6 +26,7 @@ use warnings; use File::Path qw(remove_tree); use File::Copy; use File::Glob qw(bsd_glob GLOB_TILDE GLOB_NOCHECK); +use I18N::Langinfo qw(langinfo CODESET); use POSIX qw(:sys_wait_h); use Dpkg (); @@ -643,6 +644,21 @@ if ($sanitize_env) { run_vendor_hook('sanitize-environment'); } +# If LC_ALL is set, the codeset is coming from that. If it is not set, +# then it either comes from LC_CTYPE or LANG. We can ignore the LANGUAGE +# GNU extension, as that only overrides LC_MESSAGES which takes the codeset +# from LC_CTYPE or LANG anyway. +my $codeset = langinfo(CODESET); +if ($codeset ne 'UTF-8') { + error(g_('requires a locale with a UTF-8 codeset')); +} +# But we also need to check whether LC_COLLATE has an expected value. +my $lc_collate = $ENV{LC_ALL} || $ENV{LC_COLLATE} || $ENV{LANG} || q{}; +if ($lc_collate ne 'C.UTF-8' && $lc_collate ne 'POSIX.UTF-8') { + error(g_('LC_COLLATE="%s" needs to be C.UTF-8 (or POSIX.UTF-8)'), + $lc_collate); +} + my $build_driver = Dpkg::BuildDriver->new( ctrl => $ctrl, debian_rules => \@debian_rules, -- 2.45.2