On 02/08/2013 06:49 AM, Stefan Hajnoczi wrote: > On Wed, Feb 06, 2013 at 01:31:33PM +0100, Benoît Canet wrote: >> This patchset create the core infrastructure for deduplication and enable it. > > Here is a high-level overview of qcow2 dedup for others reviewing this series:
Awesome. Benoît, how much of this information should be added in either contents or commit messages of the various parts of your series? > > Data structures > --------------- > > Deduplication adds one new on-disk structure: the dedup hash table. Each > entry > contains: > * Hash value of a data cluster (32 bytes) > * Logical offset of the first cluster with these contents (8 bytes) > > Unallocated clusters have no hash value so the dedup hash table uses an L1 > table to allow sparse allocation of dedup hash table entries. Dedup hash > table > blocks are accessed through dedup_cluster_cache, a cache of the L2 blocks. > > The dedup hash table is indexed by physical offset, in other words it can be > used to look up the hash for a given offset into the image file. This means > it > cannot be used to find duplicate clusters. Or rather, that finding whether any other cluster has the same hash would require an expensive O(n) linear scan, if using only the on-disk table. > > This is solved by building an in-memory lookup tree when the file is opened. > The lookup tree maps hashes to the physical offset and first logical offset - > it is an inverse mapping. > > Since the dedup hash table is already scanned on startup, the forward mapping > (physical offset to hash) is also loaded into an in-memory lookup tree. > > Finally we have arrived at the two trees used for deduplication: > 1. dedup_tree_by_hash: hash -> <physical offset, first logical offset> > 2. dedup_tree_by_sect: physical offset -> <hash, first logical offset> > > dedup_tree_by_sect is the in-memory version of the dedup hash table on disk. > > Read operation > -------------- > > Reads are unaffected by dedup. > > Write operation > --------------- > > Writes must be cluster-aligned/cluster-sized or else they incur extra reads to > load the untouched sectors (similar to copy-on-write). This is necessary > because the hash is calculated for the entire cluster so we always work in > cluster units. > > When a duplicate is found the refcount of the existing cluster is incremented. > Since refcounts are 16-bit there is a risk of overflowing, for example with an > all-zero disk image. A threshold is set for breaking deduplication and > creating a new cluster for the sake of preventing refcount overflow. You probably also want to mention what happens when one of the clusters that previously map to a common hash is then modified, because it requires updating the lookup tables to reflect the new hash and a possible change in first logical offset associated with the old hash. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature