Hi Timothy, On Thu, 03 Feb 2022 at 10:46, Timothy Sample <samp...@ngyro.com> wrote:
>> But the question is if Disarchive dissambles and preserves external >> patches. Timothy? [...] > The bad news is that 0.75 is not there. At first I was going to > apologize for the shortcomings of the sampling approach... until I > realized you are trying to trick me! ;) Unless I’m misreading the Git > history, that patch appeared and disappeared on core-updates and was > never part of master. Because of the good news, the same could be applied for these patches, no? For instance, one missing patch–as Maxime pointed it–is there: https://github.com/archlinux/svntogit-packages/blob/155510dd18d2f290085f40d2a95a3701db4a224d/texlive-bin/repos/extra-x86_64/pdftex-poppler0.75.patch And SWH contains it: https://archive.softwareheritage.org/browse/revision/155510dd18d2f290085f40d2a95a3701db4a224d/?path=texlive-bin/repos/extra-x86_64/pdftex-poppler0.75.patch Therefore, somehow it “only” misses to dissamble this data and add an entry to the database, no? I miss what you mean by «was never part of master». After the merge, what was core-updates and what was master is somehow indistinguishable, no? Or are you walking only to first-parent after merge commit? Well, Git history and sorting leads to headache; as git-log doc shows. :-) I think it is fine to simplify “complex” history with a sampling considering only first-parent walk. > The way the PoG script tracks down sources is pretty robust. It takes > the derivation graph to be canonical, and only uses the graph of > high-level objects (packages, origins, etc.) for extra info. I do my > best to follow the links of the high-level objects, and then verify that > I did a good job by lowering them and checking coverage against the set > of derivations found by following the derivation graph. Since the > derivation graph necessarily contains everything that matters, this is a > good way to track down all the high-level objects that matter. See > <https://git.ngyro.com/preservation-of-guix/tree/pog-inferior.scm#n113> > for a rather scary looking procedure that finds the edges of the > high-level object graph. Cool! Thanks for explaining and pointing how PoG is doing. > That being said, coverage is not perfect. The most obvious problem (to > me) is the sampling approach. Surely there are sources that are missed > by only examining one commit per week. This can be checked and fixed by > using data from the Guix Data Service, which has data from essentially > every Guix commit. No, the Data Service and even Cuirass are using a sampling approach too; they do not process all the commits. Cuirass uses a «every 5 minutes» approach; please CI savvy people correct me if I mistake. The Data Service uses a «batch guix-commits» approach; more details in this thread [1]. Well, the coverage is twofold, IMHO. 1. preserve what is currently entering in Guix 2. archive what was available in Guix About #1, the main mechanism are sources.json, “guix lint”, and update disarchive-db (now done by CI). What is missed should be fixed by #2. About #2, it is hard to fix all the issues at once. One commit per week already provides a good view to spot some problems. Somehow, process all the commits just means burn more CPU; it seems “easy” once the infrastructure is in-place, no? 1: <https://yhetil.org/guix/863617oe1h....@gmail.com/> Cheers, simon