On Wed, Jan 9, 2013 at 4:24 PM, Benoît Canet <benoit.ca...@irqsave.net> wrote: > Here is a mail to open a discussion on QCOW2 deduplication design and > performance. > > The actual deduplication strategy is RAM based. > One of the goal of the project is to plan and implement an alternative way to > do > the lookups from disk for bigger images. > > > I will in a first section enumerate the disk overheads of the RAM based lookup > strategy and then in the second section enumerate the additionals costs of > doing > lookups in a disk based prefix b-tree. > > Comments and sugestions are welcome. > > I) RAM based lookups overhead > > The qcow2 read path is not modified by the deduplication patchset. > > Each cluster written gets its hash computed. > > Two GTrees are used to give access to the hashes : one indexed by hash and > one other indexed by physical offset.
What is the GTree indexed by physical offset used for? > I.0) unaligned write > > when a write is unaligned or smaller than a 4KB cluster the deduplication code > issue one or two reads to get the missing data required to build a 4KB*n > linear > buffer. > The deduplication metrics code show that this situation don't happen with > virtio > and ext3 as a guest partition. If the application uses O_DIRECT inside the guest you may see <4 KB requests even on ext3 guest file systems. But in the buffered I/O case the file system will use 4 KB blocks or similar. > > I.1) First write overhead > > The hash is computed > > the cluster is not duplicated so the hash is stored in a linked list > > after that the writev call get a new 64KB L2 dedup hash block corresponding to > the physical sector of the writen cluster. > (This can be an allocating write requiring to write the offset of the new > block > in the dedup table and flush) > > The hash is written in the l2 dedup hash block and flushed later by the > dedup_block_cache > > I.2) Same cluster rewrite at the same place > > The hash is computed > > qcow2_get_cluster_offset is called and the result is used to check that it is > a > rewrite > > The cluster is counted as duplicated and not rewriten on disk This case is when identical data is rewritten in place? No writes are required - this is the scenario where online dedup is faster than non-dedup because we avoid I/O entirely. > > I.3) First duplicated cluster write > > The hash is computed > > qcow2_get_cluster_offset is called and we see that we are not rewriting the > same > cluster at the same place > > I.3.a) The L2 entry of the first cluster written with this hash is overwritten > to remove the QCOW_OFLAG_COPIED flag. > > I.3.b) the dedup hash block of the hash is overwritten to remember at the next > startup that QCOW_OFLAG_COPIED has been cleared. > > A new L2 entry is created for this logical sector pointing to the physical > cluster. (potential allocating write) > > the refcount of the physical cluster is updated > > I.4) Duplicated clusters further writes > > Same as I.2 without I.3.a and I.3.b > > I.5) cluster removal > When a L2 entry to a cluster become stale the qcow2 code decrement the > refcount. > When the refcount reach zero the L2 hash block of the stale cluster > is written to clear the hash. > This happen often and require the second GTree to find the hash by it's > physical > sector number This happens often? I'm surprised. Thought this only happens when you delete snapshots or resize the image file? Maybe I misunderstood this case. > I.6) max refcount reached > The L2 hash block of the cluster is written in order to remember at next > startup > that it must not be used anymore for deduplication. The hash is dropped from > the > gtrees. Interesting case. This means you can no longer take snapshots containing this cluster because we cannot track references :(. Worst case: guest fills the disk with the same 4 KB data (e.g. zeroes). There is only a single data cluster but the refcount is maxed out. Now it is not possible to take a snapshot. Stefan