Hi Timothy, On Wed, 20 Oct 2021 at 15:48, Timothy Sample <samp...@ngyro.com> wrote:
> Early this summer I did a bunch of work trying to figure out which Guix > sources are preserved by the SWH archive. I’m finally ready to share > some preliminary results! > > https://ngyro.com/pog-reports/2021-10-20/ Cool! Really interesting. > What’s cool is that the report is automated. Next on my list is to > update the database and generate a new report. Then, we can compare the > results and see if we are improving. (My read on the results so far is > that improving “sources.json” will yield big improvements, but we might > not be able to get to that before the next report.) Here two minor comments: 1. Since a couple of days, I run: $ GUIX_SWH_TOKEN=$TOKEN guix lint -c archival where $TOKEN is provided by the SWH Authentication service [1]. Instead of a rate limit at 120, it is 1200. Therefore, more ’git-fetch’ packages are added. I am in the process to automate that but do not hold your breath. :-) 2. For still unknown reasons, the bridge between SWH and Disarchive has some holes. For instance, $ guix lint -c archive znc gnu/packages/messaging.scm:996:12: znc@1.8.2: Disarchive entry refers to non-existent SWH directory '33a3b509b5ff8e9039626d11b7a800281884cf2a' $ wget https://guix.gnu.org/sources.json $ cat sources.json | jq | grep znc "integrity": "sha256-IwbxlQzncsWlmlf1SG1Zu5yrmEl8RfxJy8RawN7BGbs=" "integrity": "sha256-q0jatpd+j0PW//szIo0ViGX2jd5wJtEjxpPXcznc8rs=" "https://znc.in/releases/archive/znc-1.8.2.tar.gz" $ guix download https://znc.in/releases/archive/znc-1.8.2.tar.gz Starting download of /tmp/guix-file.hnjWTE From https://znc.in/releases/archive/znc-1.8.2.tar.gz... znc-1.8.2.tar.gz 2.0MiB 599KiB/s 00:03 [##################] 100.0% /gnu/store/58khbiwp2ghhzg00gnzdy2jlfv49vajm-znc-1.8.2.tar.gz 03fyi0j44zcanj1rsdx93hkdskwfvhbywjiwd17f9q1a7yp8l8zz Therefore, something is wrong somewhere. Because of #1, I detect many of such examples. I do not know if SWH-ID computed by Disarchive is incorrect or if SWH has not ingested. Investigations required. :-) 1: <https://archive.softwareheritage.org/api/> > It’s surprising to me that SWH is not already getting these from > “sources.json”. I picked an arbitrary one, “rust-quote-0.6”, and it’s > simply not in “sources.json”. On the other hand, I bet SWH would like a > crates.io (and CRAN, etc.) loader, too. >From the SWH doc, there is a CRAN lister [2] but I have not checked what they ingest concretely. Because on our side, we are using ’url-fetch’ and it appears to me possible to have a tiny mismatch between what is inside the release tarball (what we concretely use) vs what SWH ingests directly from CRAN. 2: <https://docs.softwareheritage.org/devel/apidoc/swh.lister.cran.html?highlight=cran#module-swh.lister.cran> And answering to your question [3] about “sources.json”, I think the ingestion started after this commit 35bb77108fc7f2339da0b5be139043a5f3f21493 from guix-artwork. Other said, SWH started to ingest from “sources.json” after July 2020; probably around September 2020. 3: <https://lists.gnu.org/archive/html/guix-devel/2021-10/msg00141.html> > One other way to help would be to suggest improvements to the report. I > don’t want to fiddle with it too much, but if there is some simple graph > or table or list that should be there, I’m happy to give it a go. For the Missing and Unknown fields, could you distinguish the kind of origin? Is it mainly git-fetch or url-fetch or others? It would help to spot the issues to work on it (sources.json, SWH side, Disarchive, etc.). Cheers, simon