Hi Ludo, Well, you enlarge the discussion to more than the issue of the 5 url-fetch packages on gforge.inria.fr :-)
First of all, you wrote [1] ``Migration away from tarballs is already happening as more and more software is distributed straight from content-addressed VCS repositories, though progress has been relatively slow since we first discussed it in 2016.'' but on the other hand Guix uses more than often [2] "url-fetch" even if "git-fetch" is available upstream. Other said, I am not convinced the migration is really happening... The issue would be mitigated if Guix transitions from "url-fetch" to "git-fetch" when possible. 1: https://forge.softwareheritage.org/T2430#45800 2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html Second, trying to do some stats about the SWH coverage, I note that non-neglectible "url-fetch" are reachable by "lookup-content". The coverage is not straightforward because of the 120 request per hour rate limit or unexpected server error. Another story. Well, I would like having numbers because I do not know what is concretely the issue: how many "url-fetch" packages are reachable? And if they are unreachable, is it because they are not in yet? or is it because Guix does not have enough info to lookup them? On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <l...@gnu.org> wrote: > For the now, since 70% of our packages use ‘url-fetch’, we need to be > able to fetch or to reconstruct tarballs. There’s no way around it. Yes, but for example all the packages in gnu/packages/bioconductor.scm could be "git-fetch". Today the source is over url-fetch but it could be over git-fetch with https://git.bioconductor.org/packages/flowCore or g...@git.bioconductor.org:packages/flowCore. Another example is the packages in gnu/packages/emacs-xyz.scm and the ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for example using http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD So I would be more reserved about the "no way around it". :-) I mean the 70% could be a bit mitigated. > In the short term, we should arrange so that the build farm keeps GC > roots on source tarballs for an indefinite amount of time. Cuirass > jobset? Mcron job to preserve GC roots? Ideas? Yes, preserving source tarballs for an indefinite amount of time will help. At least all the packages where "lookup-content" returns #f, which means they are not in SWH or they are unreachable -- both is equivalent from Guix side. What about in addition push to IPFS? Feasible? Lookup issue? > For the future, we could store nar hashes of unpacked tarballs instead > of hashes over tarballs. But that raises two questions: > > • If we no longer deal with tarballs but upstreams keep signing > tarballs (not raw directory hashes), how can we authenticate our > code after the fact? Does Guix automatically authenticate code using signed tarballs? > • SWH internally store Git-tree hashes, not nar hashes, so we still > wouldn’t be able to fetch our unpacked trees from SWH. > > (Both issues were previously discussed at > <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.) > > So for the medium term, and perhaps for the future, a possible option > would be to preserve tarball metadata so we can reconstruct them: > > tarball = metadata + tree There is different issues at different levels: 1. how to lookup? what information do we need to keep/store to be able to query SWH? 2. how to check the integrity? what information do we need to keep/store to be able to verify that SWH returns what Guix expects? 3. how to authenticate? where the tarball metadata has to be stored if SWH removes it? Basically, the git-fetch source stores 3 identifiers: - upstream url - commit / tag - integrity (sha256) Fetching from SWH requires the commit only (lookup-revision) or the tag+url (lookup-origin-revision) then from the returned revision, the integrity of the downloaded data is checked using the sha256, right? Therefore, one way to fix lookup of the url-fetch source is to add an extra field mimicking the commit role. The easiest is to store a SWHID or an identifier allowing to deduce the SWHID. I have not checked the code, but something like this: https://pypi.org/project/swh.model/ https://forge.softwareheritage.org/source/swh-model/ and at package time, this identifier is added, similarly to integrity. Aside, does Guix use the authentication metadata that tarballs provide? ( BTW, I failed [3,4] to package swh.model so if someone wants to give a try. 3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html 4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html ) > After all, tarballs are byproducts and should be no exception: we should > build them from source. :-) [...] > The code below can “disassemble” and “assemble” a tar. When it > disassembles it, it generates metadata like this: [...] > The ’assemble-archive’ procedure consumes that, looks up file contents > by hash on SWH, and reconstructs the original tarball… Where do you plan to store the "disassembled" metadata? And where do you plan to "assemble-archive"? I mean, What is pushed to SWH? And how? What is fetched from SWH? And how? (Well, answer below. :-)) > … at least in theory, because in practice we hit the SWH rate limit > after looking up a few files: Yes, it is 120 request per hour and 10 save per hour. Well, I do not think they will increase much these numbers in general. However, they seem open for specific machines. So, I do not want to speak for them, but we could ask an higher rate limit for ci.guix.gnu.org for example. Then we need to distinguish between source substitutes and binary substitutes. And basically, when an user runs "guix build foo", if the source is not available upstream nor already on ci.guix.gnu.org, then ci.guix.gnu.org fetch the missing sources from SWH and delivers it to the user. > https://archive.softwareheritage.org/api/#rate-limiting > > So it’s a bit ridiculous, but we may have to store a SWH “dir” > identifier for the whole extracted tree—a Git-tree hash—since that would > allow us to retrieve the whole thing in a single HTTP request. Well, the limited resources of SWH is an issue but SWH is not a mirror but an archive. :-) And as I wrote above, we could ask to SWH to increase the rate limit for specific machine such as ci.guix.gnu.org > I think we’d have to maintain a database that maps tarball hashes to > metadata (!). A simple version of it could be a Git repo where, say, > ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would > contain the metadata above. The nice thing is that the Git repo itself > could be archived by SWH. :-) How this database that maps tarball hashes to metadata should be maintained? Git push hook? Cron task? What about foreign channels? Should they maintain their own map? To summary, it would work like this, right? at package time: - store an integrity identiter (today sha256-nix-base32) - disassemble the tarball - commit to another repo the metadata using the path (address) sha256/base32/<identitier> - push to packages-repo *and* metadata-database-repo at future time: (upstream has disappeared, say!) - use the integrity identifier to query the database repo - lookup the SWHID from the database repo - fetch the data from SWH - or lookup the IPFS identifier from the database repo and fetch the data from IPFS, for another example - re-assemble the tarball using the metadata from the database repo - check integrity, authentication, etc. Well, right it is better than only adding an identifier for looking up as I described above; because it is more general and flexible than only SWH as fall-back. The format of metadata (disassemble) that you propose is schemish (obviously! :-)) but we could propose something more JSON-like. All the best, simon