Re: Validating tarballs against git repositories

Guillem Jover Sat, 30 Mar 2024 06:16:56 -0700

Hi!

On Fri, 2024-03-29 at 23:53:20 -0600, Antonio Russo wrote:
> On 2024-03-29 22:41, Guillem Jover wrote:
> > On Fri, 2024-03-29 at 18:21:27 -0600, Antonio Russo wrote:
> >> Had tooling existed in Debian to automatically validate this faithful
> >> reproduction, we might not have been exposed to this issue.
> > 
> > Given that the autogenerated stuff is not present in the git tree,
> > a diff between tarball and git would always generate tons of delta,
> > so this would not have helped.
> 
> I may not have been clear, but I'm suggesting scrubbing all the
> autogenerated stuff, and comparing that against a similarly scrubbed
> git tag contents.  (But you explain that this is problematic.)


Yes, the point here is how we determine what is autogenerated stuff
when confronted with a malicious upstream, so the problem again is
that if you need to verify everything then you might easily get
overwhelmed by sheer amount of autogenerated output. But see below.

> >> Having done this myself, it has been my experience that many partial
> >> build artifacts are captured in source tarballs that are not otherwise
> >> maintained in the git repository.  For instance, in zfs (which I have
> >> contributed to in the past), many automake files are regenerated.
> >> (I do not believe that specific package is vulnerable to an attack
> >> on the autoconf/automake files, since the debian package calls the
> >> upstream tooling to regenerate those files.)
> 
> (Hopefully the above clears up that I at least have some superficial
> awareness of the build artifacts showing up in the release tarball!)

(Sorry, I guess my reply might have sounded patronizing? I noticed later
on that you explicitly mentioned this, but thought that would be clear
then when reading the whole mail, thought about adding a note to the
earlier text, but considered it unnecessary. Should have probably added
it anyway. :)

> >> 1. Move towards allowing, and then favoring, git-tags over source tarballs
> > 
> > I assume you mean git archives out of git tags? Otherwise how do you
> > go from git-tag to a source package in your mind?
> 
> I'm not wed to any specific mechanism, but I'd be content with that.  I'd
> be most happy DD-signed tags that were certified dfsg, policy compliant
> (i.e., lacking build artifacts), and equivalent to scrubbed upstream source.
> (and more on that later, building on what you say).
> 
> Many repositories today already do things close to this with pristine-tar,
> so this seems to me a direction where the tooling already exists.
> 
> I'll add that, if we drop the desire for a signed archive, and instead
> require a signed git-tag (from which we can generate a source tar on
> demand, as you suggest), we can drop the pristine-tar requirement.  If we
> are less progressive, but move to exclusively with Debian-regenerated
> .tar files, we can probably avoid many of the frustrating edge cases that
> pristine-tar still struggles with.

I'm personally not a fan of pristine-tar, and my impression is that it
is falling out of favor in various corners and big teams within the
project. And then I'm also not a fan either for mixing packaging with
upstream git history. The non-native packages I maintain only contain
debian/ directories, which to me have the superior properties (but not
tooling), including in a situation like this. I'll expand on this later.

I've been thinking and, perhaps the only thing we'd need, is to include
either a file or a field in some file that refers to the upstream commit
we think the tarball is derived from. We also have fields that contain
the upstream VCS repo. Then we could also have tooling that could perform
such checks, independently from how we transport and pack our sources.

> >> 2. Require upstream-built artifacts be removed (instead, generate these
> >>    ab-initio during build)
> > 
> > The problem here is that the .m4 file to hook into the build system was
> > named like one shipped by gnulib (so less suspicious), but xz-utils does
> > not use gnulib, and thus the autotools machinery does not know anything
> > about it, so even the «autoreconf -f -i» done by debhelper via
> > dh-autoreconf, would not regenerate it.
> 
> The way I see it, there are two options in handling a buildable package:
> 
> 1. That file would have been considered a build artifact, consequently
> removed and then regenerated.  No backdoor.
>
> 2. The file would not have been scrubbed, and a difference between the
> git version and the released tar version would have been noticed.
> Backdoor found.
> 
> Either of these is, in my mind, dramatically better than what happened.

Sure, but that relies on knowing for certain what is and what is not
autogenerated for 1), and to not be able to get drown in autogenerated
output for 2) so that this cannot be easily missed, and for autoreconf
to do what we expect! Also important is when this would be done, only
on initial packaging, on every build? Because 1) has the bad property
that it might get removed during initial packaging inspection, but
might then stay latent until surrounding conditions activate it again
if we are not continuously performing that kind of check (say on every
build of the package).

> One automatic approach would be run dh-autoreconf and identify the
> changed files.  Remove those files from both the distributed tarball and
> git tag.  Check if those differ. (You also suggest something very similar
> to this, and repacking the archive with those debian-generated build
> artifacts).

> I may be missing something here, though!

In theory this would be an option, I'm not sure how feasible this is
in practice, though. :/ At least as of now.

> > Removing these might be cumbersome after the fact if upstream includes
> > for example their own maintained .m4 files. See dpkg's m4 dir for an
> > example of this (although there it's easy as all are namespaced but…).
> 
> I am not an m4 expert (in fact, I have specifically tried to avoid
> learning anything more about auto(make/reconf) than absolutely necessary.
> 
> My point is just: either those files are needed, or not.  If they're
> needed, they need to not differ.  And if they're not, they should
> be scrubbed.
> 
> I think you are saying that doing this automatically is going to be
> hard/impossible.  Is that fair?

Let's try to go in detail on how this was done on the build system
side (I'm doing this right now, as previously only had skimmed over
the process).

The build system hook was planted in the tarball by adding a modified
m4/build-to-host.m4 file. This file is originally from gnulib (but
gettext would usually embed it if it required it). The macros contained
within are used by m4/gettext.m4 coming from gettext.

So to start with, this dependency (the AM_GNU_GETTEXT macro uses
gl_BUILD_TO_HOST) is only present with newer gettext versions. The
tarball was autoreconf'ed with gettext 0.22.4, Debian has gettext 0.21,
which does not pull that dependency in. In that case if gettext.m4
would get modified in this build now, then the hook would be inert,
but once we update to a newer gettext then it would get activated
again.

The m4/build-to-host.m4 file in addition to hooking the payload into
the build system, also got its serial number bumped from 3 to 30.

And the bigger issue is that «autoreconf -f -i» does not even refresh
the files (as you'd expect from the --force), if the .m4 serial is higher.
So in Debian currently, the gettext.m4 in the tarball does not get
refreshed (still pulling in the malicious build-to-host.m4, which
would not happen with the gettext version from Debian), and if we
updated to a newer gettext then it would not update build-to-host.m4
anyway due to its bumped serial.

This seems like a serious bug in autoreconf, but I've not checked if
this has been brought up upstream, and whether they consider it's
working as intended. I expect the serial to be used only when not
in --force mode though. :/


On the other side of the coin, is that the malicious actor added
precisely that .m4 file into its .gitignore file (as you'd usually
expect due to the new gettext version pulling that in), so if we were
ignoring changes based on trusting upstream, then that could slip
through if we only use this for checking, unless we repackage or always
clean these before builds:

  
https://git.tukaani.org/?p=xz.git;a=commitdiff;h=4323bc3e0c1e1d2037d5e670a3bf6633e8a3031e

> > Not using an upstream provided tarball, might also mean we stop being
> > able to use upstream signatures, which seems worse. The alternative
> > might be promoting for upstreams to just do the equivalent of
> > «git archive», but that might defeat the portability and dependency
> > reduction properties that were designed into the autotools build
> > system, or increase the bootstrap set (see for example the
> > pkg.dpkg.author-release build profile used by dpkg).
> 
> Ok, so am I understanding you correctly in that you are saying: we do
> actually want *some* build artifacts in the source archives?

Upstream might want them. And if we repackage out of principle, then
we might lose on other properties, such as signatures for code
provenance trails and similar.

> If that's the case, could make those files at packaging time, analogous
> to the DFSG-exclude stripping process?

Ideally we'd remove all autogenerated files, but I'm not sure how
effective that would be against a determined malicious actor.

> >> In the present case, the triggering modification was in a modified .m4 file
> >> that injected a snippet into the configure script.  That modification
> >> could have been flagged using this kind of process.
> > 
> > I don't think this modification would have been spotted, because it
> > was not modifying a file it would usually get autogenerated by its
> > build system.
> 
> If we look at what ./autogen.sh would have changed, and scrub those
> files from the release archive, wouldn't that mean that the malicious
> m4 file would have been spotted, since it would NOT have been autogenerated?

Not in this case no. autoreconf would have left alone both the
gettext.m4 and the build-to-host.m4 files. I mean you are left with
those (and other) files for manual review, and while like Russ I think
I'm fluent in m4 and autotools stuff, if I had to skim review that
file, it would look very non-suspicious to me, TBH.

> >> While this would be a lot of work, I believe doing so would require a
> >> much larger amount of additional complexity in orchestrating attacks
> >> against Debian in the future.
> > 
> > It would certainly make it a bit harder, but I'm afraid that if you
> > cannot trust upstream and they are playing a long game, then IMO they
> > can still sneak nasty stuff even in plain sight with just code commits,
> > unless you are paying extreme close attention. :/
> > 
> > See for example <https://en.wikipedia.org/wiki/Underhanded_C_Contest>.
> 
> I take a look at these every year or so to keep me terrified of C!
> If it's a single upstream developer, I absolutely agree, but if there's an
> upstream community reviewing the git commits, I really do believe there is
> hope (of them!) identifying bad(tm) things.
> 
> I just want to make sure that what we actually pull in is what the
> community is actually reviewing.  I feel like anything less gets dangerous.
> (Given few enough eyeballs, all bugs are deep!)
> 
> But, I will definitely concede that, had I seen a commit that changed
> that line in the m4, there's a good chance my eyes would have glazed over
> it.

I think the biggest issue is that we are pretty much based on a model
that relies on trusting upstreams, for code, for license and copyright
compliance, etc. We tend to assume upstreams (and us!) can make
mistakes, but that in general they are not working against us.

When confronted with a known hostile (and not necessarily malicious)
upstream the only winning game is not to play. If we do not even know
the upstream is hostile and/or malicious that seems like a losing
prospect to me. There are so many ways such upstream can slip stuff
through in this model that this gets really nasty really quickly.

Don't get me wrong, I think we can/should modify our processes and
tooling somehow to at last try tot deter this path as much as possible,
but it still seems to go counter to our model, and seems like a losing
prospect. (You could have an upstream that tries to overwhelm you with
sheer amount of commits for example. In this case they even included
the bulk of the backdoor in git, and in the end I guess I don't see
much difference between smuggling something through git or a tarball.)

And, coming back to the Debian side of things. To me the most
important part is that we might be able to close a bit this door with
upstream, but what about this happening within Debian? I think we have
discussed in the past, what would happen if someone tried this kind of
long term attack on the project, and my feeling is that we have kind
of shrugged it off as either "it would take too much effort so it's
implausible" or "if they want to do it we are lost anyway" but perhaps
I'm misremembering.

Related to this, dgit has been brought up as the solution to this, but
in my mind this incident reinforces my view that precisely storing
more upstream stuff in git is the opposite of what we'd want, and
makes reviewing even harder, given that in our context we are on a
permanent fork against upstream, and if you include merge commits and
similar, there's lots of places to hide stuff. In contrast storing
only the packaging bits (debian/ dir alone) like pretty much every
other downstream is doing with their packaging bits, makes for an
obviously more manageable thing to review and not get drown into,
more so if we have to consider that next time perhaps the long-game
gets played within Debian. :(

(An additional bonus of only keeping debian/ directories is that it
would make it possible to checkout all Debian packaging locally. :)

Thanks,
Guillem

Re: Validating tarballs against git repositories

Reply via email to