Op 05-03-2023 om 21:21 schreef Simon Tournier:
Whatever the intrinsic identifier we consider – even ones based on very weak cryptographic hash function as MD5, or based on non-crytographic hash function as Pearson hashing, etc. – the integrity check is currently done by SHA256.How about using the hash of the integrity check as an intrinsic identifier, like is done currently? I mean, we hash it anyway with sha256 for the integrity check anyway, might as reuse it.Maybe ask GNUnet folk to address by NAR+SHA256 instead on their specification. ;-)
Obviously, Guix should replace NAR+SHA256 by GNUnet FS URIs /j.
Kidding aside, your comment rises two points of view: 1. Guix is fetching data from elsewhere and this elsewhere is not using NAR+SHAR256 intrinsic identifier. Therefore, the question is how to adapt the source origin for taking into account this elsewhere? 2. Replace the NAR+SHA256 integrity checksum by what content-addressed systems use as intrinsic identifier. IMHO, that’s a bad idea for two reasons: (a) security, for instance SHA1 as used by SWH is not secure and (b) it will be unmanageable in practise.
I was thinking of (1), not (2).
All that’s said, Guix uses extrinsic identifiers for almost all origins, if not all. Even for ’git-fetch’ method.For git-fetch, the value of the 'commit' field is intrinsic (except when it's a tag instead).No, it is imprecise. The exception is *not* label tag as value for the ’commit’ field but the exception is Git commit hash as value.
Are you referring to the fact that currently, the 'commit' field usually contains a tag name, and that it containing a commit is the exception?
If so, that doesn't contradict my claim.
This can be solved by placing the actual commit in the 'commit' field of git-reference, instead of the tag name, then things are completely unambiguous -- this and its opposite were discussed in ‘On raw strings in <origin> commit field’ (*), IIRC.The thread you are referencing [1] is based on misunderstandings. I would like to move forward, hence my detailed email. :-) 1: <https://yhetil.org/guix/6e451a878b749d4afb6eede9b476e5faabb0d609.ca...@gmail.com/#r>
Your email is about intrinsic identifiers and more robustness, yet it doesn't mention using git commits more anywhere. As such, I do not follow ‘hence my detailed email’ -- it contains detail, but it misses some relevant detail that I pointed out in my previous response.
Also, with ‘move forward’, do you mean ‘move forward’, or ‘maintain status quo’? Because given that you are replying to the proposed solution (that even avoids problems pointed out in those threads) by saying nothing of technical importance and by pointing to some contentious things, it really appears the latter to me.
(*) Also maybe that thread about tricking peer review. I didn't understand the position that commit field should contain the (indirect, fragile) tag instead of the (direct, robust) commit, but those differences could be sidestepped by having both a 'tag' field and a 'commit' field, IIUC.I would not frame this way. My view is not to replace something by something else, instead, is to add something and/or several things.
I was thinking of adding the commit (intrinsic) to the git-reference, instead of only having a tag (extrinsic) in the git-reference as is mostly done currently.
I also want to mention that, except of a general notion of 'more robustness' and a specific command "guix freeze -m manifest.scm" and such, you never mentioned what your view was, so I had to guess.
The problem then was to somehow map the NAR hash to the FS identifier.Yes, that’s the problem. :-) GNUnet FS identifier is one case. And my discussion here is: could we augment source origin to be able to deal with various identifier?A straightforward solution would be to just replace the https:// by gnunet:// in the origin (like in https://issues.guix.gnu.org/44199, except that patch doesn't support fallbacks to other URLs like url-fetch does).Somehow, your proposition would be to have a list as URI, right? (origin (method gnunet-fetch) (uri (list (string-append "mirror://gnu/hello/hello-" version ".tar.gz") "gnunet://fs/chk/TY48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0" "shw:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a" (file-name "gnunet-hello-2.10.tar.gz") (sha256 (base32 "0ssi1wpaf7plaswqqjwigppsg5fyh99vdlb9kzl7c9lng89ndq1i")
Yes, though in a proper version of 44199 (which doesn't exist yet) it would just be integrated into url-fetch instead of having a separate gnunet-fetch.
It is not affordable, neither wanted, to switch from the current extrinsic identification to a complete intrinsic one. Although it would fix many issues. ;-)How about in-between: include both an intrinsic identifier (the sha256sum) and an extrinsic identifier (the URLs to locate the object at), like the status quo.That’s what I am proposing between the lines. :-)
I recommend being explicit.
The question is which design. For instance, it could go under the field ’properties’ similarly as “upstream name” or potentially other “metadata”. Or it could go under the source origin field. Well, however as you pointed, being a ’properties’ would not be as easy. And as you also pointed, the integrity field could be something else than ’sha256’, so maybe we could have a list here.
To be clear, my comment on Guix supporting other things than sha256 was just a statement of fact, not a proposal to use that mechanism (and neither a proposal to not use that mechanism).
The discussion could also fit how to distribute using ERIS.ERIS is not a method on its own; you need to combine it with a P2P network that uses ERIS. I do not understand the special focus on ERIS.Yes, indeed. However, to my knowledge, each P2P can use its own identifier and from my understanding, ERIS relies on whatever P2P. Therefore, willing guix-daemon being able to use ERIS, it somehow implies a discussion about the identifiers used by the P2P networks. Do I miss something?
I don't have any issue with ERIS itself (*). The issue I have with ERIS, is that it often appears to be treated as some panacea that transcends all P2P systems and is fundamentally different from other identifiers used by other P2P systems, but <https://xkcd.com/927/> applies here -- while it might become some universal standard, it isn't yet.
Hence, ‘I do not understand the __special__ focus on ERIS’ (emphasis added). As long as the ERIS identifier is treated as one among many instead of somehow being considered special, it's fine to me.
(*) Besides several technical issues in its current implementation -- the implementation of ERIS is optimised for classical transports instead of P2P transports, ERIS is only implemented for IPFS currently and ERIS doesn't have a deduplication system for directories. (In GNUnet and BitTorrent, and I think in IPFS and BitTorrent too, if two directories (e.g. store items) that have a file in common were put into the P2P, then for the P2P's purposes these two files are the same file, so availability of one store item aids the availability of another store item.)
At some point, I was thinking to have something like “guix freeze -m manifest.scm” returning a map of all the sources from the deep bootstrap to the leaf packages described in manifest.scm. However, maybe something is poor in the metadata we collect at package time.That sounds like "guix build --sources=transitive' to me, except for being even more transitive. I propose making this an additional option for the --sources argument instead.No. “guix build --sources=transitive” returns an archive containing all the sources. Instead, I would like the all various identifiers (URL, NAR, SWHID, GNUnet, etc.) of all the transitive sources.
I do not see how making a list of all identifiers helps with robustness -- you need the object the identifiers point to, not the identifier itself.
Unless the goal is to use the map of package->identifiers to determine which packages are currently lacking redundancy (i.e., have few identifiers), which to be clear seems reasonable to me.
Cheers, simon PS:However the fields ’swhid’ and the other SHA256 ’digest’ are different from above. That’s because the dots [...] part. It probably comes from the normalization process. Well, I am not sure to deeply understand why it is different but that’s another story. :-)The reason for the normalisation was something about SWH only providing tarballs whose contents are equal to the ingested tarball; the tarballs are not bit-for-bit identical to the ingested tarball. But Guix needs bit-for-bit identical tarballs, so Disarchive contains the information that was stripped-out by SWH to complement the tarballs provided by Disarchive.SWH is not in the picture with the example I provided. :-) Yes, the dots part is related to some normalization and “metadata”.
Your question was about where the differences come from. The answer is ‘because SWH normalisation stuff’. As such, SWH is in the picture.
What I do not understand is, if “guix build hello -S” is manually uncompressed and untar, the content corresponds to: $ guix hash -S git -H sha256 -f hex hello-2.12.1 cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4 The tool ’disarchive’ dissembles the compressed archive; it first provides the hash of the compressed archive (.tar.gz), then store metadata about compression level, algorithm etc, then provides the hash of the uncompressed archive (.tar), then store metadata about files and last it provides the hash of the tree, it reads, (input (directory-ref (version 0) (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1") (addresses (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a")) (digest (sha256 "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0")))))))) and I do not understand why it is not the same as manually computed; see above. Well, that’s a detail and not relevant to the current discussion since it is part of how Disarchive works internally.
You are hashing the 'hello-2.12.1' directory, which is the only directory in the tarball. However, while it is considered bad practice, a tarball can contain multiple top-level entries. As such, you should consider the tarball as an encoding of a directory that happens to contain the 'hello-2.12.1' directory, and hash the wrapper directory instead of its member hello-2.12.1:
$ mkdir a $ cd a $ tar -xf /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz $ guix hash -Sgit -H sha256 -f hex . 1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0 Using these steps, the value in the (digest (sha256 ...)) is recovered. Greetings, Maxime.
OpenPGP_0x49E3EE22191725EE.asc
Description: OpenPGP public key
OpenPGP_signature
Description: OpenPGP digital signature