On Mon, Feb 27, 2023 at 01:56:26PM +0000, Richard W.M. Jones wrote: > > https://github.com/kubevirt/containerized-data-importer/issues/1520 > > Hi Eric, > > We had a question from the Kubevirt team related to the above issue. > The question is roughly if it's possible to calculate the checksum of > an image as an nbdkit filter and/or in the qemu block layer.
In the qemu block layer - yes: see Nir's https://gitlab.com/nirs/blkhash Note that there is a huge difference between a block-based checksum (a checksum of the block data the guest will see) and a checksum of the original file (bytes as visible on the source, although with non-raw files, more than one image may hash to the same guest-visible contents despite having different host checksums). Also, it may prove to be more efficient to generate a Merkle Tree hash of an image (an image is divided into smaller portions in a binary-tree fanout, where the hash of the entire image is computed by combining hashes of child nodes up to the root of the tree - which allows downloading blocks out of order). [You may be more familiar with Merkle Trees than you realize - every git commit id is ultimately a Merkle Tree hash of all prior commits] As for nbdkit being able to do hashing as a filter, we don't have such a filter now, but I think it would be technically possible to implement one. The trickiest part would be figuring out a way to expose the checksum to the client once the client has finally read through the entire image. It would be easy to have nbdkit output the resulting hash in a secondary file for consumption by the end client, harder but potentially more useful would be extending the NBD protocol itself to allow the NBD client to issue a query to the server to provide the hash directly (or an indication that the hash is not yet known because not all blocks have been hashed yet). > > Supplemental #1: could qemu-img convert calculate a checksum as it goes > along? Nir's work on blkhash seems like that is doable. > > Supplemental #2: could we detect various sorts of common errors, such > a webserver that is incorrectly configured and serves up an error page > containing "<html>"; or something which is supposed to be a disk image > but does not "look like" (in some ill-defined sense) a disk image, > eg. it has no partition table. > > I'm not sure if qemu has any existing features covering the above (and > I know for sure that nbdkit doesn't). Indeed. But adding a filter that does a pre-read of the plugin's firsts 1M during .prepare to look for an expected signature (what is sufficient, seeing if there is a partition table?) and refuses to let the client connect if the plugin is serving wrong data seems fairly straightforward. > > One issue is that calculating a checksum involves a linear scan of the > image, although we can at least skip holes. Or intentionally choose a hash that can be computed out-of-order, such as a Merkle Tree. But we'd need a standard setup for all parties to agree on how the hash is to be computed and checked, if it is going to be anything more than just a linear hash of the entire guest-visible contents. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs