All, (going out to both debian-devel and bug-gnulib, please be respectful of each community's different perspectives and trim Cc when focus shifts to any Debian or gnulib specific topics) (please pardon the accidental duplicate post to bug-gnulib...)
The content of upstream source code releases can largely be categorized into 1) the actual native source-code from the upstream supplier, 2) pre-generated artifacts from build tools (e.g., ./configure script) and 3) third-party maintained source code (e.g., config.guess or getopt.c). The files in 3) may be referred to as "vendoring". The habit of including vendored and pre-generated artifacts is a powerful and successful way to make release tarballs usable for users, going back to the 1980's. This habit pose some challenges for packaging though: 1) Pre-generated files (e.g., ./configure) should be re-generated to make sure the package is built from source code, and to allow patches on the toolchain used to generate the pre-generated files to have any effect. Otherwise we risk using pre-generated files created using non-free or non-public tools, which if I understand correctly against Debian main policy. Verifying that this happens for all pre-generated files in an upstream tarball is complicated, fragile and tedious work. I think it is simple to find examples of mistakes in this area even for important top-popcon Debian packages. The current approach of running autoreconf -fi is based on a misunderstanding: autoreconf -fi is documented to not replace certain files with newer versions: https://lists.nongnu.org/archive/html/bug-gnulib/2024-04/msg00052.html 2) If a security problem in vendored code is discovered, the security team may have to patch 50+ packages if the vendor origin is popular. Maybe even different versions of the same vendored code has to be patched. 3) Auditing the difference between the tarball and what is stored in upstream version control system (VCS) is challenging. The xz incident exploited the fact that some pre-generated files aren't included in upstream VCS. Some upstream react to this by adding all pre-generated artifacts to VCS -- OpenSSH seems to take the route of adding the generated ./configure script to git, which moves that file from 3) to 1) but the problem is remaining. 4) Auditing for license compliance is challenging, since not only do we have to audit all upstream's code but we also have to audit the license of pre-generated files and vendored source-code. There are probably more problems involved, and probably better ways to articulate the problems than what I managed to do above. The Go and Rust ecosystems solve some of these issues, which has other consequences for packaging. We have largely ignored that the same challenges apply to many C packages, and I'm focusing on those that uses gnulib -- https://www.gnu.org/software/gnulib/ -- gzip, tar, grep, m4, sed, bison, awk, coreutils, grub, libiconv, libtasn1, libidn2, inetutils, etc: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=users.txt Solving all of the problems for all packages will require some work and will take time. I've started to see if we can make progress on the gnulib-related packages. I'm speaking as contributor to gnulib and maintainer of a couple of Debian packages, but still learning to navigate -- the purpose of this post is to describe what I've done for libntlm and ask for feddback to hopefully make this into a re-usable pattern that can be applied to more packages. It would be great to improve collaboration on these topics between GNU and Debian. So let's turn this post into a recipe for Debian maintainers of packages that use gnulib to follow for their packages. I'm assuming git for now on, but feel free to mentally s/git/$VCS/. The first step is to establish an upstream tarball that you want to work with. There are too many opinions floating around on this to make any single solution a pre-requisite so here are the different approaches I can identify, ordered by my own preference, and the considerations with each. 1) Use upstream's PGP signed git-archive tarball. See my recent blog posts for this new approach. The key property here is that there is no need to audit difference between upstream tarball and upstream git. https://blog.josefsson.org/2024/04/01/towards-reproducible-minimal-source-code-tarballs-please-welcome-src-tar-gz/ https://blog.josefsson.org/2024/04/13/reproducible-and-minimal-source-only-tarballs/ 2) Use upstream's PGP signed tarball. This is the current most common and recommended approach, as far as I know. 3) Create a PGP signed git-archive tarball. If upstream doesn't publish PGP signed tarballs, or if there is a preference from upstream or from you as Debian package maintainer to not do 1) or 2), then create a minimal source-only copy of the git archive and sign it yourself. Could be done something like this: git clone https://git.savannah.gnu.org/git/inetutils.git cd inetutils/ git archive --prefix=inetutils-v2.5/ -o inetutils-2.5-src.tar.gz v2.5 # additional filtering of tarball may go here gpg -b inetutils-2.5-src.tar.gz This is your new upstream tarball. To build this particular one, use ./bootstrap --no-git --gnulib-srcdir=/usr/share/gnulib. 4) Use upstream's git-archive tarball and PGP sign it. Download it using the GitHub or GitLab download link on the git tag like the cool kids. If you did this on a sunny day, the downloaded tarball should be identical to the git-archive tarball and you can sign it if you are comfortable with this. 5) Use upstream's git-archive tarball. For those who want to join the really cool kids club. 6) Use upstream's tarball without PGP signature. This is quite common today. It happens when upstream doesn't publish PGP signatures or the Debian maintainer doesn't care about them. Regardless of mechanism, you should end up with a tarball that we call the "upstream tarball". Which approach is chosen is subjective and up to the Debian package maintainer. people have different opinions. While I can't hide my own preferences I think we have to acknowledge that there is no single uniform answer here. To reach our goals in the beginning of this post, this upstream tarball has to be filtered to remove all pre-generated artifacts and vendored code. Use some mechanism, like the debian/copyright Files-Excluded mechanism to remove them. If you used a git-archive upstream tarball, chances are higher that you won't have to do a lot of work especially for pre-generated scripts. This filtered tarball will be the *.orig.tar.gz used to build the Debian package. Ideally you would like for the *.orig.tar.gz tarball to be as close as possible to upstream's git repository for the tag release, minus any pre-generated scripts or vendored gnulib files that upstream put into git. For collaborative upstreams, you could try to convince them to not put pre-generated scripts and vendored gnulib files into git. Auditing the upstream tarball to the *.orig.tar.gz should be simple, use sha256sum or diffoscope to compare content. In some ideal world this could be bit-by-bit identical. I'm hoping this can be the new best recommended approach going forward. This is only possible when upstream agree with these concerns, and make an effort to publish such minimized source-only tarballs. This may be a pipe dream, just like Debian's current best recommended approach for upstream PGP signed tarballs are sometimes ignored. You will now be faced with the challenge of building this tarball. Your existing debian/rules makefile will not work any more since it assumed the existance of the pre-generated scripts and vendored gnulib files. So you have to add the required tools as Build-Depends: and update the debian/rules to build everything from source code. For libntlm the essential diff between version 1.7-1, that used upstream tarball with pre-generated content and gnulib code, and latest version 1.8-3 that builds from a minimal source-only tarball is small: --- a/debian/control +++ b/debian/control @@ -6,6 +6,8 @@ Uploaders: Simon Josefsson <si...@josefsson.org>, Build-Depends: debhelper-compat (= 13), + git, + gnulib (>= 20240412~dfb7117+stable202401.20240408~aa0aa87-3~), Standards-Version: 4.6.2 Section: libs Homepage: https://www.nongnu.org/libntlm/ --- a/debian/rules +++ b/debian/rules @@ -1,6 +1,16 @@ #! /usr/bin/make -f +include /usr/share/gnulib/debian/gnulib-dpkg.mk + export DEB_BUILD_MAINT_OPTIONS = hardening=+all %: - dh $@ --builddirectory=build -X.la + dh $@ --without autoreconf --builddirectory=build + +pull: + ./bootstrap --gnulib-srcdir=$(GNULIB_DEB_DEBIAN_GNULIB) --pull + +gen: + ./bootstrap --gnulib-srcdir=$(GNULIB_DEB_DEBIAN_GNULIB) --gen + +execute_before_dh_auto_configure: dh_gnulib_clone pull dh_gnulib_patch gen As you can see the essential part is to add a Build-Depends on the gnulib Debian package to get the necessary gnulib code for building. We also disable dh_autoreconf since its approach is no longer necessary (and hides problems), everything is built from source coming from Debian or upstream. There is one design of gnulib that is important to understand: gnulib is a source-only library and is not versioned and has no release tarballs. Its release artifact is the git repository containing all the commits. Packages like coreutils, gzip, tar etc pin to one particular commit of gnulib. There is little coordination among packages which gnulib git commit to use, and historically they typically use the latest gnulib git commit that was published when the release manager prepared a release. Usually the pinning happens through a git submodule or through the GNULIB_REVISION bootstrap.conf mechanisms, but there is no requirement from gnulib on this. This method will vary between packages that uses gnulib, and it is not necessary to enforce a particular style. The gnulib package since version 20240412~dfb7117+stable202401.20240408~aa0aa87-3 -- and YES we will change that horrible version number, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1069268 -- provides the gnulib git repository as a git bundle in /usr/share/gnulib/gnulib.bundle, and two mechanisms to help you as maintainer of a gnulib-consuming package: 1) /usr/share/gnulib/debian/gnulib-dpkg.mk https://salsa.debian.org/debian/gnulib/-/blob/f777347e10cbfa1c6dcfd9df7ec98bfb08815f14/debian/local/gnulib-dpkg.mk 2) /usr/bin/dh_gnulib_patch https://manpages.debian.org/unstable/gnulib/dh_gnulib_patch.1.en.html https://salsa.debian.org/debian/gnulib/-/blob/f777347e10cbfa1c6dcfd9df7ec98bfb08815f14/debian/local/dh_gnulib_patch Those links provide useful additional discussion, please read them. The gnulib-dpkg.mk file is intended to be included from your debian/rules file to provide some opt-in gnulib-related definitions. Some of them are used in libntlm's debian/rules above. The dh_gnulib_patch tool is intended to patch gnulib code before it used to build your package. This dh_gnulib_patch approach is the only reason for having to split up the build process in debian/rules into two steps above. Why is that complication necessary? To understand this, consider if there is a security vulnerability in gnulib. We could have 50+ packages in Debian that uses gnulib and import this security buggy code. It would be nice to patch the gnulib package and rebuild the dependent packages and things would be well. But we can't do that. Packages pin to different gnulib git commits. You can't force all packages to pin to the same gnulib git commit and modify only that commit. So we need a mechanism that can patch ALL commits of gnulib that contain the security buggy code, and do that from the gnulib package. That is what dh_gnulib_patch does. Then the bug fixes to gnulib code can be centralized into the packages in which they originate: gnulib. I'm hoping by now that any maintainer of a Debian package that uses gnulib will be able to experiment and see if they are able to build their package using this approach. I have not made any attempts to do this myself. I'm sure there are some complicated problems that remain, I am aware of the git-version-gen that produces an UNKNOWN version in a git-archive tarball situation. That could be worked around in Debian or fixed upstream in gnulib with a git export-subst approach inspired by this: https://github.com/ansemjo/version.sh What do you think of this approach? Are there any fundamental design issues or bad assumption? I'd like to see people think about that. Besides git-version-gen there is another known open issue: while the gnulib package does not have any packages in Depends, did you notice that Build-Depends on git we added to libntlm? That Build-Depends:git may be a cross-building and bootstrappability problem: in order to build libntlm you will need to have git available on your target platform. Now fortunately git doesn't depend on libntlm, but git has a non-minimal Depends list that may be troublesome. Of the indirect dependencies of git there is GnuTLS and libtasn1 that both uses gnulib. Is the assumption that 'git' is available on a new target platform a real problem? I would assume that (some stripped down version of) git is a requirement to do any useful work on any platform these days, so maybe it isn't a problem and you can fake this Depends to cross/bootstrap build libntlm (and any gnulib-using package) anyway. IF the dependency on git is significant problem, we can come up with other solutions. The reason for having the dependency on git in the first place is that all gnulib-consuming packages pin to a particular gnulib git commit, and to extract that gnulib git commit, we currently use the git tool on the gnulib bundle. But that isn't the only design that is possible. Consider the naive approach to simply ship an unpacked gnulib directory into /usr/share/gnulib/aacceb6eff58eba91290d930ea9b8275699057cf/ for each and every gnulib git commit. This is not practical since gnulib is huge. However some other tool than 'git' could be designed to extract gnulib git commits from a git bundle, and this stripped down tool could be used instead. Maybe this is already a concern, and that it makes sense to work on a package like 'git-minimal' that doesn't have a minimal Depends: list? Finally, while this is somewhat gnulib specific, I think the practice goes beyond gnulib, and that the two-phase ./bootstrap approach is a generic re-usable interface that applies to many packages. Could this be generalized into a 'dh_bootstrap' sequence that replace the 'dh_autoreconf' sequence? That would simplify libntlm's debian/rules a bit further. Having more experiemnts with these concepts would be nice, to see if this is feasible. This hypothetical dh_bootstrap sequence would assume that the *.orig.tar.gz is minimized and does not contain any pre-generated files or vendored files, and that it has to rebuild them from scratch. This is different from the dh_autoreconf approach which assumes the tarball contains pre-generated scripts and vendored code that should be re-generated. My opinion is that trying to re-bootstrap an existing tarball in this way is fragile and invites problems like the xz incident. It seems nicer to build from pure source code and not have to audit pre-generated scripts and vendored files in Debian's *.orig.tar.gz's. /Simon
signature.asc
Description: PGP signature