Re: make dist using git archive

2024-01-26 Thread Eli Schwartz
Hello, meson developer here.


On 1/23/24 4:30 AM, Peter Eisentraut wrote:
> On 22.01.24 21:04, Tristan Partin wrote:
>> I am not really following why we can't use the builtin Meson dist
>> command. The only difference from my testing is it doesn't use a
>> --prefix argument.
> 
> Here are some problems I have identified:
> 
> 1. meson dist internally runs gzip without the -n option.  That makes
> the tar.gz archive include a timestamp, which in turn makes it not
> reproducible.


Well, it uses python tarfile which uses python gzip support under the
hood, but yes, that is true, python tarfile doesn't expose this tunable.


> 2. Because gzip includes a platform indicator in the archive, the
> produced tar.gz archive is not reproducible across platforms.  (I don't
> know if gzip has an option to avoid that.  git archive uses an internal
> gzip implementation that handles this.)


This appears to be https://github.com/python/cpython/issues/112346


> 3. Meson does not support tar.bz2 archives.


Simple enough to add, but I'm a bit surprised as usually people seem to
want either gzip for portability or xz for efficient compression.


> 4. Meson uses git archive internally, but then unpacks and repacks the
> archive, which loses the ability to use git get-tar-commit-id.


What do you use this for? IMO a more robust way to track the commit used
is to use gitattributes export-subst to write a `.git_archival.txt` file
containing the commit sha1 and other info -- this can be read even after
the file is extracted, which means it can also be used to bake the ID
into the built binaries e.g. as part of --version output.


> 5. I have found that the tar archives created by meson and git archive
> include the files in different orders.  I suspect that the Python
> tarfile module introduces some either randomness or platform dependency.


Different orders is meaningless, the question is whether the order is
internally consistent. Python uses sorted() to guarantee a stable order,
which may be a different algorithm than the one git-archive uses to
guarantee a stable order. But the order should be stable and that is
what matters.


> 6. meson dist is also slower because of the additional work.


I'm amenable to skipping the extraction/recombination of subprojects and
running of dist scripts in the event that neither exist, as Tristan
offered to do, but...


> 7. meson dist produces .sha256sum files but we have called them .sha256.
>  (This is obviously trivial, but it is something that would need to be
> dealt with somehow nonetheless.)
> 
> Most or all of these issues are fixable, either upstream in Meson or by
> adjusting our own requirements.  But for now this route would have some
> significant disadvantages.


Overall I feel like much of this is about requiring dist tarballs to be
byte-identical to other dist tarballs, although reproducible builds is
mainly about artifacts, not sources, and for sources it doesn't
generally matter unless the sources are ephemeral and generated
on-demand (in which case it is indeed very important to produce the same
tarball each time). A tarball is usually generated once, signed, and
uploaded to release hosting. Meson already guarantees the contents are
strictly based on the built tag.


-- 
Eli Schwartz


OpenPGP_0x84818A6819AF4A9B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: make dist using git archive

2024-01-31 Thread Eli Schwartz
On 1/31/24 3:03 AM, Peter Eisentraut wrote:
>> What do you use this for? IMO a more robust way to track the commit used
>> is to use gitattributes export-subst to write a `.git_archival.txt` file
>> containing the commit sha1 and other info -- this can be read even after
>> the file is extracted, which means it can also be used to bake the ID
>> into the built binaries e.g. as part of --version output.
> 
> It's a marginal use case, for sure.  But it is something that git
> provides tooling for that is universally available.  Any alternative
> would be an ad-hoc solution that is specific to our project and would be
> different for the next project.


mercurial has the "archivemeta" config setting that exports similar
information, but forces the filename ".hg_archival.txt".

The setuptools-scm project follows this pattern by requiring the git
file to be called ".git_archival.txt" with a set pattern mimicking the
hg one:

https://setuptools-scm.readthedocs.io/en/latest/usage/#git-archives


So, I guess you could use this and then it would not be specific to your
project. :)


>> Overall I feel like much of this is about requiring dist tarballs to be
>> byte-identical to other dist tarballs, although reproducible builds is
>> mainly about artifacts, not sources, and for sources it doesn't
>> generally matter unless the sources are ephemeral and generated
>> on-demand (in which case it is indeed very important to produce the same
>> tarball each time).
> 
> The source tarball is, in a way, also an artifact.
> 
> I think it's useful that others can easily independently verify that the
> produced tarball matches what they have locally.  It's not an absolute
> requirement, but given that it is possible, it seems useful to take
> advantage of it.
> 
> In a way, this also avoids the need for signing the tarball, which we
> don't do.  So maybe that contributes to a different perspective.


Since you mention signing and not as a simple "aside"...

That's a fascinating perspective. I wonder how people independently
verify that what they have locally (I assume from git clones) matches
what the postgres committers have authorized.

I'm a bit skeptical that you can avoid the need to perform code-signing
at some stage, somewhere, somehow, by suggesting that people can simply
git clone, run some commands and compare the tarball. The point of
signing is to verify that no one has acquired an untraceable API token
they should not have and gotten write access to the authoritative server
then uploaded malicious code under various forged identities, possibly
overwriting previous versions, either in git or out of git.

Ideally git commits should be signed, but that requires large numbers of
people to have security-minded git commit habits. From a quick check of
the postgres commit logs, only one person seems to be regularly signing
commits, which does provide a certain measure of protection -- an
attacker cannot attack via `git push --force` across that boundary, and
those commits serve as verifiable states that multiple people have seen.

The tags aren't signed either, which is a big issue for verifiably
identifying the release artifacts published by the release manager. Even
if not every commit is signed, having signed tags provides a known
coordination point of code that has been broadly tested and code-signed
for mass use.

...

In summary, my opinion is that using git-get-tar-commit-id provides zero
security guarantees, and if that's not something you are worried about
then that's one thing, but if you were expecting it to *replace* signing
the tarball, then that's very much another thing entirely, and not
one I can agree at all with.



-- 
Eli Schwartz


OpenPGP_0x84818A6819AF4A9B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature