Hi!

On Sat, 2024-07-06 at 13:13:48 +0200, Alexandre Detiste wrote:
> Le mar. 2 juil. 2024 à 14:37, Guillem Jover <guil...@debian.org> a écrit :
> > On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote:
> > > On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:
> > > > On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
> > > > > Maybe a compromise would be to at least mandate some UTF-8 locale.
> > > >
> > > > dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
> > > > building packages
> > >
> > > Allowing ASCII seems counterproductive: that puts us in the code path
> > > where various tools and runtimes (especially Python) will refuse to
> > > process or output anything outside the 0-127 range, which I believe is
> > > exactly the problem that debhelper aims to solve by using C.UTF-8 for
> > > some categories of package (in particular those that build with Meson).
> > >
> > > To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
> > > ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
> > > the C locale is not).
> >
> > But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
> > using --sanitive-env, right away though.

I did something like that as part of dpkg 1.22.7, with commit:

  
https://git.dpkg.org/cgit/dpkg/dpkg.git/commit/?id=df60765ed4bc6640b788c796dd0c627d7714f807

Which should guarantee a UTF-8 codeset and stable sorting, while
preserving any translated output messages (and other locale settings).

> One thing that could be fixed quite quickly is fixing the few
> remaining official buildd workers that does not yet run with an UTF-8 locale.

Something I realized after adding the above change, is that sbuild has
been running dpkg-buildpackage with --sanitize-env for a while now,
which checking now I was told about at the time, but I either didn't
piece together its consequences or perhaps forgot that the sbuild
package is nowadays used in build daemons (instead of the old fork)
and then forgot. :) (BTW, not blaming josch! I think that change in
sbuild on its own makes sense, I guess I was just not expecting the
option to be used that way, and perhaps its documentation should have
somehow made it more clear. :)

I guess this is both good and "bad". It's good because now all build
daemons will have a guaranteed UTF-8 locale codeset already starting
with Debian trixie, as requested in this thread, and give us a more
uniform build environment. It's "bad" because part of the reason to
add this through a new --sanitize-env option was to make this behavior
and its guarantees opt-in, but if the official Debian builds are using
this, then it's in a way equivalent to having set this by default w/o
the option, but perhaps worse because people running local build will
not have the same environment (although it's going to be easy to
replicate by passing that option, but a bit harder when calling
debian/rules directly which we still support).

I'm not sure the current state is ideal, because we are back to
packages being able to rely on some stuff on build daemons, that are
not guaranteed by default for our supported build entry points, and if
the result of this is that we end up patching all dpkg-buildpackage
callers to pass --sanitize-env, then we could have as well simply
changed the default instead. I think a way forward could be to make
the sanitizing the default, and finally drop debian/rules as a
supported (user) build entry point, I had in mind re-proposing this
already, but the above kind of gives it more urgency, so I'll try to
do that soon.

This also means, I guess, that part of the previous freedom I thought
we had to modify the --sanitize-env behavior is kind of gone now (and
would be gone too if we move its behavior as the default one), and we
should apply similar care as if the default itself was being changed,
because it has the potential to break the archive (via build daemons).
I'm thinking that depending on the changes there, it might be better
to gate them via dpkg-build-api(7) levels. I should also document the
vendor specific behavior in some manual page, as it is currently
listed as unspecified "vendor specific".

> If one is unlucky the build will mysteriously fail.
> 
> Adding  export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8
> in every single d/rules by fear of this seems overkill.

> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586
> https://buildd.debian.org/status/package.php?p=xrayutilities

I implemented the attached patch (also on the next/req-utf8 branch) to
force a locale with a UTF-8 codeset, which would be a no-op now when
using --sanitize-env, but I didn't merge that for now, because I'm not
sure of the potential fallout, given that other infrastructure things
might be running dpkg-buildpackage w/o passing --sanitize-env. So I
think those would need to be found and changed before deploying that
change.

  <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/log/?h=next/req-utf8>

But then, I guess whether merging that makes sense or not also depends
on how we want to proceed with the debian/rules build entry point, and
whether we are going to switch the default or transition to amend all
callers (which might still not catch private infra and similar).

Thanks,
Guillem
From 49dd377f0cea0d6a23ef619a2f77b268e3f5e14a Mon Sep 17 00:00:00 2001
From: Guillem Jover <guil...@debian.org>
Date: Tue, 2 Jul 2024 03:34:35 +0200
Subject: [PATCH] dpkg-buildpackage: Require an UTF-8 locale when building
 packages

For LC_COLLATE we also require C.UTF-8 or POSIX.UTF-8 to guarantee a
stable sorting during builds, w/o disrupting other localization outputs
that might be wanted by the person starting the build.

Proposed-by: Alexandre Detiste <alexandre.deti...@gmail.com>
---
 scripts/dpkg-buildpackage.pl | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl
index d849d6e90..e400f9da1 100755
--- a/scripts/dpkg-buildpackage.pl
+++ b/scripts/dpkg-buildpackage.pl
@@ -26,6 +26,7 @@ use warnings;
 use File::Path qw(remove_tree);
 use File::Copy;
 use File::Glob qw(bsd_glob GLOB_TILDE GLOB_NOCHECK);
+use I18N::Langinfo qw(langinfo CODESET);
 use POSIX qw(:sys_wait_h);
 
 use Dpkg ();
@@ -643,6 +644,21 @@ if ($sanitize_env) {
     run_vendor_hook('sanitize-environment');
 }
 
+# If LC_ALL is set, the codeset is coming from that. If it is not set,
+# then it either comes from LC_CTYPE or LANG. We can ignore the LANGUAGE
+# GNU extension, as that only overrides LC_MESSAGES which takes the codeset
+# from LC_CTYPE or LANG anyway.
+my $codeset = langinfo(CODESET);
+if ($codeset ne 'UTF-8') {
+    error(g_('requires a locale with a UTF-8 codeset'));
+}
+# But we also need to check whether LC_COLLATE has an expected value.
+my $lc_collate = $ENV{LC_ALL} || $ENV{LC_COLLATE} || $ENV{LANG} || q{};
+if ($lc_collate ne 'C.UTF-8' && $lc_collate ne 'POSIX.UTF-8') {
+    error(g_('LC_COLLATE="%s" needs to be C.UTF-8 (or POSIX.UTF-8)'),
+          $lc_collate);
+}
+
 my $build_driver = Dpkg::BuildDriver->new(
     ctrl => $ctrl,
     debian_rules => \@debian_rules,
-- 
2.45.2

Reply via email to