Ludovic Courtès <ludovic.cour...@inria.fr> writes: > I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what > the current status of the “preservation of Guix” is, and what remains > to be done: > > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf
Wow – great work! > I chatted with the SWH tech team; they’re obviously very busy solving > all sorts of scalability challenges :-) but they’re also truly > interested in what we’re doing and in supporting our use case. Off the > top of my head, here are some of the topics discussed: > > • ingesting past revisions: if we can give them ‘sources.json’ for > past revisions, they’re happy to ingest them; This is something I can probably coax out of the Preservation of Guix database. That might be the cheapest way to do it. Alternatively, when we get “sources.json” built with Cuirass, we could tell Cuirass to build out a sample of previous commits to get pretty good coverage. (Side note: eventually we could verify the coverage of the sampling approach using the Data Service, which has a processed a very exhaustive list of commits.) > • rate limit: we can find an arrangement to raise it for the purposes > of statistics gathering like Simon and Timothy have been doing (we > can discuss the details off-list); Cool! So far it hasn’t been a concern for me, but it would help in the future if want to try and track down Git repositories that have gone missing. > • Disarchive: they’d like to better understand the “unknowns” in the > PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and > to work on the definitely-missing origins that show up there; Many of the unknowns are there for me to track Disarchive progress. It’s not really the clearest reporting, but it tracks more what Guix can handle automatically than what we could theoretically know about. Basically something is “known” if it can be downloaded from upstream, and either: it’s a non-recursive Git reference; or it’s something Disarchive can handle. Hence, we know nothing about other version control systems and, say, “.tar.bz2” archives. Also, all these things are based on heuristics. :) As we get closer to 100% known, we can start analyzing everything more closely. > they’re not opposed to the idea of eventually hosting or maintaining > the Disarchive database (in fact one of the developers thought we > were hosting it in Git and that as such they were already archiving > it—maybe we could go back to Git?); It’s a possibility, but right now I’m hopeful that the database will be in the care of SWH directly before too long. I’d rather wait and see at this point. I’m sure we could manage it, but the uncompressed size of the Disarchive specification of a Chromium tarball is 366M. Storing all the XZ specifications uncompressed is over 20G. It would be a big Git repo! > • bit-for-bit archival: there’s a tension between making SWH a > “canonical” representation of VCS repos and making it a faithful, > bit-for-bit identical copy of the original, and there are different > opinions in the team here; our use case pretty much requires > bit-for-bit copies, and fortunately this is what SWH is giving us in > practice for Git repos, so checkout authentication (for example) > should work even when fetching Guix from SWH. That’s interesting. I’m sure most of us in the Guix camp are on team bit-for-bit, but I’m sure we can all agree that it’s not easy to get there. > There were other discussions about Guix and Nix and I was pleased to see > people were enthusiastic about functional package management and about > our whole endeavor. > > Anyway I think we can take this as an opportunity to increase bandwidth > with the SWH developers! Good idea. It’s nice when our efforts and experience produce something useful to the broader free software community. :) -- Tim