On Sun, Mar 31, 2024, at 3:17 AM, Jacob Bachmeyer wrote: > Eric Gallager wrote: >> Specifically, what caught my attention was how the release tarball >> containing the backdoor didn't match the history of the project in its >> git repository. That made me think about automake's `distcheck` >> target, whose entire purpose is to make it easier to verify that a >> distribution tarball can be rebuilt from itself and contains all the >> things it ought to contain. > > The problem is that a release tarball is a freestanding object, with no > dependency on the repository from which it was produced. In this case, > the attacker added a bogus "update" of build-to-host.m4 from gnulib to > the release tarball, but that file is not stored in the Git repository. > This would not have tripped "make distcheck" because the crocked tarball > can indeed be used to rebuild another crocked tarball. > > As Alexandre Oliva mentioned in his reply, there is not really any good > way to prevent this, since the attacker could also patch the generated > configure script more directly.
I have been thinking about this incident and this thread all weekend and have seen a lot of people saying things like "this is more proof that tarballs are a thing of the past and everyone should just build straight from git". There are a bunch of reasons why one might disagree with this as a blanket statement, but I do think there's a valid point here: the malicious xz maintainer *might* have been caught earlier if they had committed the build-to-host.m4 modification to xz's VCS. (Or they might not have! Witness the three (and counting) malicious patches that they barefacedly submitted to *other* software and got accepted because the malice was subtle enough to pass through code review.) It might indeed be worth thinking about ways to minimize the difference between the tarball "make dist" produces and the tarball "git archive" produces, starting from the same clean git checkout, and also ways to identify and audit those differences. ... > Maybe the best revision to the GNU Coding Standards would be that > releases should, if at all possible, contain only text? Any binary > files needed for testing can be generated during "make check" if > necessary I don't think this is a good idea. It's only a speed bump for someone trying to smuggle malicious data into a package (think "base64 -d") and it makes life substantially harder for honest authors of programs that work with binary data, and authors of material whose "source code" (as GPLv3 uses that term) *is* binary data. Consider pngsuite, for instance (http://www.schaik.com/pngsuite/) -- it would be a *ton* of work to convert each of these test PNG files into GNU Poke scripts, and probably the result would be *less* ergonomic for purposes of improving the test suite. I would like to suggest that a more useful policy would be "files written to $prefix by 'make install' should not have any data dependency on files labeled as part of the package's testsuite". That doesn't constrain honest authors and it seems within the scope of what the reproducible builds people could test for. (Build the package, install to nonce prefix 1, unpack the tarball again, delete the test suite, build again, install to prefix 2, compare.) Of course a sufficiently determined malicious coder could detect the reproducible-build test environment, but unlike "no binary data" this is a substantial difficulty increment. zw