Hello, Here is a mail to open a discussion on QCOW2 deduplication design and performance.
The actual deduplication strategy is RAM based. One of the goal of the project is to plan and implement an alternative way to do the lookups from disk for bigger images. I will in a first section enumerate the disk overheads of the RAM based lookup strategy and then in the second section enumerate the additionals costs of doing lookups in a disk based prefix b-tree. Comments and sugestions are welcome. I) RAM based lookups overhead The qcow2 read path is not modified by the deduplication patchset. Each cluster written gets its hash computed. Two GTrees are used to give access to the hashes : one indexed by hash and one other indexed by physical offset. I.0) unaligned write when a write is unaligned or smaller than a 4KB cluster the deduplication code issue one or two reads to get the missing data required to build a 4KB*n linear buffer. The deduplication metrics code show that this situation don't happen with virtio and ext3 as a guest partition. I.1) First write overhead The hash is computed the cluster is not duplicated so the hash is stored in a linked list after that the writev call get a new 64KB L2 dedup hash block corresponding to the physical sector of the writen cluster. (This can be an allocating write requiring to write the offset of the new block in the dedup table and flush) The hash is written in the l2 dedup hash block and flushed later by the dedup_block_cache I.2) Same cluster rewrite at the same place The hash is computed qcow2_get_cluster_offset is called and the result is used to check that it is a rewrite The cluster is counted as duplicated and not rewriten on disk I.3) First duplicated cluster write The hash is computed qcow2_get_cluster_offset is called and we see that we are not rewriting the same cluster at the same place I.3.a) The L2 entry of the first cluster written with this hash is overwritten to remove the QCOW_OFLAG_COPIED flag. I.3.b) the dedup hash block of the hash is overwritten to remember at the next startup that QCOW_OFLAG_COPIED has been cleared. A new L2 entry is created for this logical sector pointing to the physical cluster. (potential allocating write) the refcount of the physical cluster is updated I.4) Duplicated clusters further writes Same as I.2 without I.3.a and I.3.b I.5) cluster removal When a L2 entry to a cluster become stale the qcow2 code decrement the refcount. When the refcount reach zero the L2 hash block of the stale cluster is written to clear the hash. This happen often and require the second GTree to find the hash by it's physical sector number I.6) max refcount reached The L2 hash block of the cluster is written in order to remember at next startup that it must not be used anymore for deduplication. The hash is dropped from the gtrees. II) Disk based lookup additional overhead My initial idea is to replace the RAM based GTree indexed by hash by a prefix b-tree stored on disk. ict.pue.udlap.mx/people/carlos/is215/papers/p11-bayer.pdf As hash are mostly random the prefix tree should work well. An additional data structure would be needed to do the lookups by physical offsets. This additional data structure is clearly a bottleneck in this design. Each lookup will cost 0(log n) disk ios. Insertion and deletion can cost the double of io (need to split and merge leafs and nodes) II.1) First write overhead O(log n) lookup in the b-tree for lookup of the hash Disks Lookups by physical sector to remove stale hash node. (What structure to use for this) When the hash is written the leaf of the failed lookup can be reused If leaf is full splitting the leaf and restructuring the tree must be done in O(log n). Update of a not yet defined way to lookup b-tree leaf by their offset on disk (potential allocating write) II.2) Same cluster rewrite at the same place lookup in the b-tree to find the hash II.3) First duplicated cluster write. lookup in the b-tree to find the hash The leaf of the b-tree must be overwritten to remember that we have cleared QCOW_OFLAG_COPIED. II.4) Duplicated cluster further writes lookup in the b-tree to find the hash II.5) cluster removal A query to find the hash b-tree leaf by sector must be done on a not yet defined data structure The hash must be removed from the b-tree leaf Two methods can be used to bypass the needs of this additional data structure. -Read the cluster then compute this hash and use the hash for the removal -Store hash and cluster forever while setting their refcount at a special value meaning "reserved for future reuse". II.6) max refcount reached The ram implementation just drop the hash corresponding to the written cluster and mark on disk that this hash must not be reloaded. A dupplicate hash is then created. I have not found yet how to handle this with a b-tree as it's not supposed to contains duplicate keys. Regards Benoît