Hi! On Fri, 2024-03-29 at 23:53:20 -0600, Antonio Russo wrote: > On 2024-03-29 22:41, Guillem Jover wrote: > > On Fri, 2024-03-29 at 18:21:27 -0600, Antonio Russo wrote: > >> Had tooling existed in Debian to automatically validate this faithful > >> reproduction, we might not have been exposed to this issue. > > > > Given that the autogenerated stuff is not present in the git tree, > > a diff between tarball and git would always generate tons of delta, > > so this would not have helped. > > I may not have been clear, but I'm suggesting scrubbing all the > autogenerated stuff, and comparing that against a similarly scrubbed > git tag contents. (But you explain that this is problematic.)
Yes, the point here is how we determine what is autogenerated stuff when confronted with a malicious upstream, so the problem again is that if you need to verify everything then you might easily get overwhelmed by sheer amount of autogenerated output. But see below. > >> Having done this myself, it has been my experience that many partial > >> build artifacts are captured in source tarballs that are not otherwise > >> maintained in the git repository. For instance, in zfs (which I have > >> contributed to in the past), many automake files are regenerated. > >> (I do not believe that specific package is vulnerable to an attack > >> on the autoconf/automake files, since the debian package calls the > >> upstream tooling to regenerate those files.) > > (Hopefully the above clears up that I at least have some superficial > awareness of the build artifacts showing up in the release tarball!) (Sorry, I guess my reply might have sounded patronizing? I noticed later on that you explicitly mentioned this, but thought that would be clear then when reading the whole mail, thought about adding a note to the earlier text, but considered it unnecessary. Should have probably added it anyway. :) > >> 1. Move towards allowing, and then favoring, git-tags over source tarballs > > > > I assume you mean git archives out of git tags? Otherwise how do you > > go from git-tag to a source package in your mind? > > I'm not wed to any specific mechanism, but I'd be content with that. I'd > be most happy DD-signed tags that were certified dfsg, policy compliant > (i.e., lacking build artifacts), and equivalent to scrubbed upstream source. > (and more on that later, building on what you say). > > Many repositories today already do things close to this with pristine-tar, > so this seems to me a direction where the tooling already exists. > > I'll add that, if we drop the desire for a signed archive, and instead > require a signed git-tag (from which we can generate a source tar on > demand, as you suggest), we can drop the pristine-tar requirement. If we > are less progressive, but move to exclusively with Debian-regenerated > .tar files, we can probably avoid many of the frustrating edge cases that > pristine-tar still struggles with. I'm personally not a fan of pristine-tar, and my impression is that it is falling out of favor in various corners and big teams within the project. And then I'm also not a fan either for mixing packaging with upstream git history. The non-native packages I maintain only contain debian/ directories, which to me have the superior properties (but not tooling), including in a situation like this. I'll expand on this later. I've been thinking and, perhaps the only thing we'd need, is to include either a file or a field in some file that refers to the upstream commit we think the tarball is derived from. We also have fields that contain the upstream VCS repo. Then we could also have tooling that could perform such checks, independently from how we transport and pack our sources. > >> 2. Require upstream-built artifacts be removed (instead, generate these > >> ab-initio during build) > > > > The problem here is that the .m4 file to hook into the build system was > > named like one shipped by gnulib (so less suspicious), but xz-utils does > > not use gnulib, and thus the autotools machinery does not know anything > > about it, so even the «autoreconf -f -i» done by debhelper via > > dh-autoreconf, would not regenerate it. > > The way I see it, there are two options in handling a buildable package: > > 1. That file would have been considered a build artifact, consequently > removed and then regenerated. No backdoor. > > 2. The file would not have been scrubbed, and a difference between the > git version and the released tar version would have been noticed. > Backdoor found. > > Either of these is, in my mind, dramatically better than what happened. Sure, but that relies on knowing for certain what is and what is not autogenerated for 1), and to not be able to get drown in autogenerated output for 2) so that this cannot be easily missed, and for autoreconf to do what we expect! Also important is when this would be done, only on initial packaging, on every build? Because 1) has the bad property that it might get removed during initial packaging inspection, but might then stay latent until surrounding conditions activate it again if we are not continuously performing that kind of check (say on every build of the package). > One automatic approach would be run dh-autoreconf and identify the > changed files. Remove those files from both the distributed tarball and > git tag. Check if those differ. (You also suggest something very similar > to this, and repacking the archive with those debian-generated build > artifacts). > I may be missing something here, though! In theory this would be an option, I'm not sure how feasible this is in practice, though. :/ At least as of now. > > Removing these might be cumbersome after the fact if upstream includes > > for example their own maintained .m4 files. See dpkg's m4 dir for an > > example of this (although there it's easy as all are namespaced but…). > > I am not an m4 expert (in fact, I have specifically tried to avoid > learning anything more about auto(make/reconf) than absolutely necessary. > > My point is just: either those files are needed, or not. If they're > needed, they need to not differ. And if they're not, they should > be scrubbed. > > I think you are saying that doing this automatically is going to be > hard/impossible. Is that fair? Let's try to go in detail on how this was done on the build system side (I'm doing this right now, as previously only had skimmed over the process). The build system hook was planted in the tarball by adding a modified m4/build-to-host.m4 file. This file is originally from gnulib (but gettext would usually embed it if it required it). The macros contained within are used by m4/gettext.m4 coming from gettext. So to start with, this dependency (the AM_GNU_GETTEXT macro uses gl_BUILD_TO_HOST) is only present with newer gettext versions. The tarball was autoreconf'ed with gettext 0.22.4, Debian has gettext 0.21, which does not pull that dependency in. In that case if gettext.m4 would get modified in this build now, then the hook would be inert, but once we update to a newer gettext then it would get activated again. The m4/build-to-host.m4 file in addition to hooking the payload into the build system, also got its serial number bumped from 3 to 30. And the bigger issue is that «autoreconf -f -i» does not even refresh the files (as you'd expect from the --force), if the .m4 serial is higher. So in Debian currently, the gettext.m4 in the tarball does not get refreshed (still pulling in the malicious build-to-host.m4, which would not happen with the gettext version from Debian), and if we updated to a newer gettext then it would not update build-to-host.m4 anyway due to its bumped serial. This seems like a serious bug in autoreconf, but I've not checked if this has been brought up upstream, and whether they consider it's working as intended. I expect the serial to be used only when not in --force mode though. :/ On the other side of the coin, is that the malicious actor added precisely that .m4 file into its .gitignore file (as you'd usually expect due to the new gettext version pulling that in), so if we were ignoring changes based on trusting upstream, then that could slip through if we only use this for checking, unless we repackage or always clean these before builds: https://git.tukaani.org/?p=xz.git;a=commitdiff;h=4323bc3e0c1e1d2037d5e670a3bf6633e8a3031e > > Not using an upstream provided tarball, might also mean we stop being > > able to use upstream signatures, which seems worse. The alternative > > might be promoting for upstreams to just do the equivalent of > > «git archive», but that might defeat the portability and dependency > > reduction properties that were designed into the autotools build > > system, or increase the bootstrap set (see for example the > > pkg.dpkg.author-release build profile used by dpkg). > > Ok, so am I understanding you correctly in that you are saying: we do > actually want *some* build artifacts in the source archives? Upstream might want them. And if we repackage out of principle, then we might lose on other properties, such as signatures for code provenance trails and similar. > If that's the case, could make those files at packaging time, analogous > to the DFSG-exclude stripping process? Ideally we'd remove all autogenerated files, but I'm not sure how effective that would be against a determined malicious actor. > >> In the present case, the triggering modification was in a modified .m4 file > >> that injected a snippet into the configure script. That modification > >> could have been flagged using this kind of process. > > > > I don't think this modification would have been spotted, because it > > was not modifying a file it would usually get autogenerated by its > > build system. > > If we look at what ./autogen.sh would have changed, and scrub those > files from the release archive, wouldn't that mean that the malicious > m4 file would have been spotted, since it would NOT have been autogenerated? Not in this case no. autoreconf would have left alone both the gettext.m4 and the build-to-host.m4 files. I mean you are left with those (and other) files for manual review, and while like Russ I think I'm fluent in m4 and autotools stuff, if I had to skim review that file, it would look very non-suspicious to me, TBH. > >> While this would be a lot of work, I believe doing so would require a > >> much larger amount of additional complexity in orchestrating attacks > >> against Debian in the future. > > > > It would certainly make it a bit harder, but I'm afraid that if you > > cannot trust upstream and they are playing a long game, then IMO they > > can still sneak nasty stuff even in plain sight with just code commits, > > unless you are paying extreme close attention. :/ > > > > See for example <https://en.wikipedia.org/wiki/Underhanded_C_Contest>. > > I take a look at these every year or so to keep me terrified of C! > If it's a single upstream developer, I absolutely agree, but if there's an > upstream community reviewing the git commits, I really do believe there is > hope (of them!) identifying bad(tm) things. > > I just want to make sure that what we actually pull in is what the > community is actually reviewing. I feel like anything less gets dangerous. > (Given few enough eyeballs, all bugs are deep!) > > But, I will definitely concede that, had I seen a commit that changed > that line in the m4, there's a good chance my eyes would have glazed over > it. I think the biggest issue is that we are pretty much based on a model that relies on trusting upstreams, for code, for license and copyright compliance, etc. We tend to assume upstreams (and us!) can make mistakes, but that in general they are not working against us. When confronted with a known hostile (and not necessarily malicious) upstream the only winning game is not to play. If we do not even know the upstream is hostile and/or malicious that seems like a losing prospect to me. There are so many ways such upstream can slip stuff through in this model that this gets really nasty really quickly. Don't get me wrong, I think we can/should modify our processes and tooling somehow to at last try tot deter this path as much as possible, but it still seems to go counter to our model, and seems like a losing prospect. (You could have an upstream that tries to overwhelm you with sheer amount of commits for example. In this case they even included the bulk of the backdoor in git, and in the end I guess I don't see much difference between smuggling something through git or a tarball.) And, coming back to the Debian side of things. To me the most important part is that we might be able to close a bit this door with upstream, but what about this happening within Debian? I think we have discussed in the past, what would happen if someone tried this kind of long term attack on the project, and my feeling is that we have kind of shrugged it off as either "it would take too much effort so it's implausible" or "if they want to do it we are lost anyway" but perhaps I'm misremembering. Related to this, dgit has been brought up as the solution to this, but in my mind this incident reinforces my view that precisely storing more upstream stuff in git is the opposite of what we'd want, and makes reviewing even harder, given that in our context we are on a permanent fork against upstream, and if you include merge commits and similar, there's lots of places to hide stuff. In contrast storing only the packaging bits (debian/ dir alone) like pretty much every other downstream is doing with their packaging bits, makes for an obviously more manageable thing to review and not get drown into, more so if we have to consider that next time perhaps the long-game gets played within Debian. :( (An additional bonus of only keeping debian/ directories is that it would make it possible to checkout all Debian packaging locally. :) Thanks, Guillem