On 1/8/26 10:04 AM, Simon Josefsson wrote:
Right, we need to keep that in mind, even if git-archive happens to
behave like this right now.
It doesn't behave like that right now.
If you run 'git archive' from a set of common distributions release in
the past 5 years you will have several different variants:
1) RHEL8/9, Ubuntu 24.04+, Debian 12+, Guix: modern variant.
2) RHEL 10 eco-system: zlib-ng, different compression. I'm trying to
ignore this, but it is becoming harder as RHEL10 spreads.
3) Ubuntu 22.04 eco-system: export-subst has a long git describe
substitution.
4) Debian 11 eco-system: no export-subst support.
Comparing GitHub, GitLab, Codeberg etc generated archives (which may or
may not use 'git archive' internally) over the last 5 years also gives
different outputs.
I don't think we can view 'git archive' as a stable output format. It
is a temporary snapshot mechanism, and the format is in continous a
moving target, and documented to be that.
It boils down to two problems:
- compression
- some niche git feature Debian may not have to support
In both Arch Linux and whatsrc the following is done to convert a git
tree-ish into a tar archive:
- Compression is disabled
- "* -export-subst -export-ignore\n" is written to `./.git/info/attributes`
- Use `-c core.abbrev=no`
https://gitlab.archlinux.org/pacman/pacman/-/commit/0828a085c146601f21d5e4afb5f396f00de2963b
https://github.com/kpcyrd/what-the-src/blob/fa1cbfd350164373221ae4950170cc288a0a3934/src/ingest/git.rs#L85-L157
This approach has worked well for Arch Linux since 2024. :)
In whatsrc the sha256-over-the-uncompressed-tar is considered
"canonical", and can have 0..n aliases. For example `sha256(gz(tar))` or
`blake2b(xz(tar))` would both resolve to `sha256(tar)` internally.
Essentially "oh, this is the hash of a compressed tar file that I've
seen before, and I know it decompresses to a tar file with hash ...".
This is also a claim that's trivial to verify (given access to the
compressed file).
In pacman it was also discussed to use a sha256 hash of the git commit
object instead, but it was decided against because it would suggest
sha256-like security while git itself virtually-always depends on sha1
for security behind the scenes[1]. It was decided that a
cryptographically secure hash over the git tree-ish export content would
be more secure.
https://gitlab.archlinux.org/pacman/pacman/-/merge_requests/9#note_93037
[1]: I looked up when SHAttered was published (the point in time that
computer people essentially agreed that SHA-1 is insecure) and it was
February 2017 apparently, almost 9 years ago. Time flies.
It was reproducible unless the git repository see further commits.
Pruning later commits somehow from the git bundle should be possible,
and then things would be reproducible again. But I don't know how.
There is some advice from git people how to do it:
https://lore.kernel.org/git/[email protected]/t/#md469596b6b95790efe045e408b1d2f19503048cd
However it looked so hacky I really didn't want to go down that road,
hoping someone else would come up with a better way to do this.
I was about to suggest `-c core.abbrev=no`, then realized it's based on
git pack files (that I'd rather consider git-internal). If it's just for
"convert a git tree-ish to some binary blob we can reason about" then
`git archive` might be our best option still.
cheers,
kpcyrd