Hey Ludo, Ludovic Courtès <ludovic.cour...@inria.fr> writes:
> I copied over the 12K entries that were missing from > disarchive.guix.gnu.org. (Note that there are currently only two copies > of the database: one at/in [bB]erlin, and one at/in [Bb]ordeaux.) > disarchive.guix.gnu.org now weighs in at 1.8 GiB for 31,839 entries. Wow – 12K! For some reason I thought it would be fewer. It’s very good that we (finally) sync’d up the databases. Also, my set is now at 31,821 after collecting the runoff from the latest Preservation of Guix Report. That’s shockingly close to the 31,839 you have. > For the remaining entries, it’s trickier. Sometimes it’s just the > gzip compression parameters that differ, which could be addressed with a > little bit more work: > > $ file ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz > ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz > ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz: > gzip compressed data, max compression, from Unix, original > size modulo 2^32 446731 > ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz: > gzip compressed data, max speed, from Unix, original size modulo 2^32 446731 I’m not sure getting the compressed files to match matters. Disarchive cares a lot about that when it comes to source code tarballs, because everybody signs and computes checksums over the compressed versions. However, for these files, the differences introduced by compression can be ignored. > Sometimes it’s trickier: > > # diff -u <(gunzip -d < > 0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz) <(gunzip > -d < > ../../disarchive/sha256/0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz) > --- /dev/fd/63 2023-03-14 16:13:21.635733426 +0100 > +++ /dev/fd/62 2023-03-14 16:13:21.635733426 +0100 > @@ -1,7 +1,7 @@ > (disarchive > (version 0) > (gzip-member > - (name "webview-sys-0.6.2.tar.gz") > + (name "rust-webview-sys-0.6.2.tar.gz") > (digest > (sha256 > "0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9")) > @@ -13,7 +13,7 @@ > (footer (crc 1807070134) (isize 121344)) > (compressor zlib-best) > (input (tarball > - (name "webview-sys-0.6.2.tar") > + (name "rust-webview-sys-0.6.2.tar") > (digest > (sha256 > > "4fb18f3206838e11f7f8caba6fad9e0f796109428b502793b9f2f0613fe0f275")) > @@ -78,7 +78,7 @@ > (padding 0) > (input (directory-ref > (version 0) > - (name "webview-sys-0.6.2") > + (name "rust-webview-sys-0.6.2") > (addresses > (swhid > "swh:1:dir:fa41df38bf639ada28c900b0915661e787fe6d15")) > (digest The name field is not used for data reconstruction. It’s for human consumption (and it may have made some early examples of use at the command line easier to explain). Here, the difference is based on the fact that Crate URIs are weird, and the Preservation of Guix code does not keep the origin file name. Hence, the PoG version extracts the Crate name alone from the URI, and the Cuirass version uses the Guix package name with the “rust-” prefix. > As Tim pointed out, Disarchive disassembly is not fully deterministic > and/or might change a bit over time as Disarchive evolves, and that’s > prolly what we’re seeing here. I honestly think this is a good thing. My instincts tell me that we should excise all sources of ambiguity, like we’re trying to do in the big picture. However, Disarchive will get better at describing things over time. For instance, it doesn’t handle tar extension headers elegantly at the moment. In the future, if I fix this, I might consider creating a “migrate” feature that improves existing specifications (e.g., converting the old, verbose representation of extension headers into the new representation). In particular, I’ve left some warts in the software in order to ship it, and I would be sad to try and commit to those for the rest of time! We might also add other resolver addresses besides SWHIDs.... Maybe I’m missing some perspective, but I don’t think trying to commit to reproducible outputs for Disarchive makes sense. -- Tim P.S., we’ll have to do this dance again shortly, as I just computed 2,023 historical bzip2 specifications. They’re not online yet, but they’ll be up when I publish the next PoG report – which should take less than a year this time! :p