Hello,

Paul: I hope you don't mind being involved in this. I would appreciate
any input that you are able to provide. Context is Debian bug #897653
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897653), but all the
important bits are already quoted below.

the general idea of pristine-tar is the following: as a distribution
maintainer using git, I want to be able to recreate the tarball released
by the upstream maintainers without duplicating the bits that I already
have on git (i.e. the actual source code). so when I import a new
release tarball, pristine-tar extracts it on my git repository, then
stores a (hopefully) small delta file that can be used later to
reconstruct this tarball I just imported from the contents of the git
repository.

This is no longer working for some (but not all) tarballs, because the
archives created by tar 1.30 and tar 1.29 are different for the same
data (see below for one example). This is not a problem for newly
imported tarballs, but it is a problem for tarballs that were imported
with tar < 1.30.

On Thu, May 03, 2018 at 05:14:54PM -0300, Antonio Terceiro wrote:
> Package: tar
> Version: 1.30+dfsg-1
> Severity: grave
> Justification: causes non-serious data loss
> 
> Hi,
> 
> After tar 1.30 arrived in unstabled, pristine-tar can no longer
> reproduce tarballs that it previously could. For example I just hit this
> with bundler:
> 
> $ debcheckout -a bundler
> declared git repository at g...@salsa.debian.org:ruby-team/bundler.git
> git clone https://salsa.debian.org/ruby-team/bundler.git bundler ...
> Cloning into 'bundler'...
> remote: Counting objects: 3981, done.
> remote: Compressing objects: 100% (1289/1289), done.
> remote: Total 3981 (delta 2570), reused 3950 (delta 2551)
> Receiving objects: 100% (3981/3981), 3.55 MiB | 24.00 KiB/s, done.
> Resolving deltas: 100% (2570/2570), done.
> git remote set-url --push origin g...@salsa.debian.org:ruby-team/bundler.git 
> ...
> $ cd bundler/
> $ pristine-tar checkout /tmp/bundler_1.16.1.orig.tar.gz
> xdelta3: target window checksum mismatch: XD3_INVALID_INPUT
> xdelta3: normally this indicates that the source file is incorrect
> xdelta3: please verify the source file with sha1sum or equivalent
> xdelta3: target window checksum mismatch: XD3_INVALID_INPUT
> xdelta3: normally this indicates that the source file is incorrect
> xdelta3: please verify the source file with sha1sum or equivalent
> xdelta3: target window checksum mismatch: XD3_INVALID_INPUT
> xdelta3: normally this indicates that the source file is incorrect
> xdelta3: please verify the source file with sha1sum or equivalent
> xdelta3: target window checksum mismatch: XD3_INVALID_INPUT
> xdelta3: normally this indicates that the source file is incorrect
> xdelta3: please verify the source file with sha1sum or equivalent
> pristine-tar: Failed to reproduce original tarball. Please file a bug report.
> pristine-tar: failed to generate tarball
> 
> See also #897249 and #897421 for similar pristine-tar user reports.
> Downgrading tar to 1.29b-2 makes it work again.

The messages above are from xdelta3, and they mean that we are trying to
apply a delta to a corrupted version of the original file from where
that delta was taken. Indeed, by dumping the tarballs created by tar
1.29 and 1.30, I was able to find the difference:

$ diffoscope ./old/0/recreatetarball ./new/0/recreatetarball
 
|##########################################################################################################################|
  100%                             Time: 0:00:00 
--- ./old/0/recreatetarball
+++ ./new/0/recreatetarball
│┄ No file format specific differences found inside, yet data differs (POSIX 
tar archive (GNU))
@@ -58403,24 +58403,24 @@
 000e4220: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4230: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4240: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4250: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4260: 0000 0000 3030 3030 3030 3000 3030 3030  ....0000000.0000
 000e4270: 3030 3000 3030 3030 3030 3000 3030 3030  000.0000000.0000
 000e4280: 3030 3030 3134 3600 3030 3030 3030 3030  0000146.00000000
-000e4290: 3030 3000 3031 3135 3636 0020 4c00 0000  000.011566. L...
+000e4290: 3030 3000 3030 3737 3536 0020 4c00 0000  000.007756. L...
 000e42a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e42b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e42c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e42d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e42e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e42f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
-000e4300: 0075 7374 6172 2020 0072 6f6f 7400 0000  .ustar  .root...
+000e4300: 0075 7374 6172 2020 0000 0000 0000 0000  .ustar  ........
 000e4310: 0000 0000 0000 0000 0000 0000 0000 0000  ................
-000e4320: 0000 0000 0000 0000 0072 6f6f 7400 0000  .........root...
+000e4320: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4330: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4340: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4350: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4360: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4370: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4380: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000e4390: 0000 0000 0000 0000 0000 0000 0000 0000  ................

I cloned the tar git repository and ran a bisection. The commit that broke
things is the following:

commit da8d0659a6fe8faf76b3a3275cf1f403e78edb1f
Author: Paul Eggert <egg...@cs.ucla.edu>
Date:   Thu Apr 6 18:16:51 2017 -0700

    --numeric-owner now affects private headers too
    
    Problem reported by Daniel Peebles in:
    http://lists.gnu.org/archive/html/bug-tar/2017-04/msg00004.html
    * NEWS: Document this.
    * src/create.c (write_gnu_long_link): If --numeric-owner,
    leave the user and group empty in a private header.  Cache the
    names for 0.

Reverting the changes in this commit "fixes" it, but of course, given
this is a patch that is supposed to *improve* reproducibility, just
reverting it is probably not what we want. I still need to study the
code a bit further to try to come up with a better suggestion.

Attachment: signature.asc
Description: PGP signature

Reply via email to