Hi! This is very interesting! Could you please export a branch to look at, as patches can't be applied on master now :(
15.10.2019 18:23, Alberto Garcia wrote: > Hi, > > this series adds a new feature to the qcow2 on-disk format called > "Extended L2 Entries", which allows us to do subcluster allocation. > > This cover letter explains the reasons behind this proposal, the > changes to the on-disk format, test results and pending work. If you > are curious you can also have a look at previous discussions about > this feature: > > https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html > https://lists.gnu.org/archive/html/qemu-block/2019-06/msg01155.html > > This is the first proper version of the patches, and I believe that > the implementation is complete. However since I'm proposing a change > to the on-disk format I'm labeling this as RFC because I'm expecting > some debate. I'll remove the RFC tag and add more tests in future > revisions. > > === Problem === > > A qcow2 image is divided into units of constant size called clusters, > and among other things it contains metadata that maps guest addresses > to host addresses (the so-called L1 and L2 tables). > > There are two basic problems that result from this: > > 1) Reading from or writing to a qcow2 image involves reading the > corresponding entry on the L2 table that maps the guest address to > the host address. This is very slow because it involves two I/O > operations: one on the L2 table and the other one on the actual > data cluster. > > 2) A cluster is the smallest unit of allocation. Therefore writing a > mere 512 bytes to an empty disk requires allocating a complete > cluster and filling it with zeroes (or with data from the backing > image if there is one). This wastes more disk space and also has a > negative impact on I/O. > > Problem (1) can be solved by caching the L2 tables in memory. The > maximum amount of disk space used by L2 tables depends on the virtual > disk size and the cluster size: > > max_l2_size = virtual_disk_size * 8 / cluster_size > > Because of this, the only way to reduce the size of the L2 tables is > by increasing the cluster size (which can be any power of two between > 512 bytes and 2 MB). But then we hit problem (2): I/O is slower and > more disk space is wasted. > > === The proposal === > > The proposal is to extend the qcow2 format by allowing subcluster > allocation. The on-disk format remains essentially the same, except > that each data cluster is internally divided into 32 subclusters of > equal size. > > The way it works in practice is with a new optional feature called > "Extended L2 Entries", that needs to be enabled when an image is > created. With this, each entry on an L2 table is accompanied by a > bitmap indicating the allocation state of each one of the subclusters > for that cluster. The size of an L2 entry doubles from 64 to 128 bits. > > Other than L2 entries, all other data structures remain unchanged, but > for data clusters the smallest unit of allocation is now the > subcluster. Reference counting is still at the cluster level, because > there is no way to reference individual subclusters. Copy-on-write on > internal snapshots needs to copy complete clusters, so that scenario > would not benefit from this change. > > I see two main use cases for this feature: > > a) The qcow2 image is not too large / the L2 cache is not a problem, > but you want to increase the allocation performance. In this case > you can have a 128KB cluster with 4KB subclusters (with 4KB being a > common block size in ext4 and other filesystems) > > b) The qcow2 image is very large and you want to save metadata space > in order to have a smaller L2 cache. In this case you can go for > the maximum cluster size (2MB) but you want to have smaller > subclusters to increase the allocation performance and optimize the > disk usage. > > === Changes to the on-disk format === > > An L2 entry is 64 bits wide, with this format (for uncompressed > clusters): > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <--------------------------------------------------><------->* > Rsrved host cluster offset of data Reserved > (6 bits) (47 bits) (8 bits) > > bit 63: refcount == 1 (QCOW_OFLAG_COPIED) > bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) > bit 0: all zeros (QCOW_OFLAG_ZERO) > > If Extended L2 Entries are enabled, bit 0 becomes reserved and must be > unset, and this 64-bit bitmap follows the entry: > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > <---------------------------------> <---------------------------------> > subcluster reads as zeros subcluster is allocated > (32 bits) (32 bits) > > All this applies to uncompressed clusters. Compressed clusters are not > divided into subclusters, the cluster descriptor remains exactly the > same, and the 64-bit bitmap is not used (i.e. all bits are always 0). > > === Test results === > > I made all tests on an SSD drive, writing to an empty qcow2 image with > a fully populated 40GB backing image, performing random writes using > fio with a block size of 4KB. I ran the tests with all available > cluster sizes starting from 4KB. > > It's important to point out that once a cluster has been completely > allocated then having subclusters offers no performance benefit. For > this reason the size of the image for these tests (40GB) was chosen to > be large enough to guarantee that there are always new clusters being > allocated. This is therefore a worst-case scenario (or best-case for > this feature, if you want). > > Subcluster sizes are in brackets: > > |-----------------+----------------+-----------------| > | Cluster size | subclusters=on | subclusters=off | > |-----------------+----------------+-----------------| > | 4 KB ( N/A ) | N/A | 95 IOPS | > | 8 KB ( N/A ) | N/A | 599 IOPS | > | 16 KB (512 B) | 4129 IOPS | 3597 IOPS | > | 32 KB (1 KB) | 11255 IOPS | 2642 IOPS | > | 64 KB (2 KB) | 13341 IOPS | 1671 IOPS | > | 128 KB (4 KB) | 12391 IOPS | 870 IOPS | > | 256 KB (8 KB) | 9645 IOPS | 566 IOPS | > | 512 KB (16 KB) | 4960 IOPS | 359 IOPS | > | 1024 KB (32 KB) | 2732 IOPS | 215 IOPS | > | 2048 KB (64 KB) | 1630 IOPS | 214 IOPS | > |-----------------+----------------+-----------------| > > Here are the same tests, but without any backing image: > > |-----------------+----------------+-----------------| > | Cluster size | subclusters=on | subclusters=off | > |-----------------+----------------+-----------------| > | 4 KB ( N/A ) | N/A | 93 IOPS | > | 8 KB ( N/A ) | N/A | 539 IOPS | > | 16 KB (512 B) | 4174 IOPS | 7598 IOPS | > | 32 KB (1 KB) | 11326 IOPS | 11957 IOPS | > | 64 KB (2 KB) | 13516 IOPS | 13375 IOPS | > | 128 KB (4 KB) | 12435 IOPS | 13274 IOPS | > | 256 KB (8 KB) | 12071 IOPS | 14174 IOPS | > | 512 KB (16 KB) | 12169 IOPS | 14343 IOPS | > | 1024 KB (32 KB) | 12307 IOPS | 14622 IOPS | > | 2048 KB (64 KB) | 12784 IOPS | 14574 IOPS | > |-----------------+----------------+-----------------| > > Some comments about the results: > > - The smallest allowed cluster size for an image with subclusters is > 16 KB (in this case the subclusters size is 512 bytes), hence the > missing values in the 4 KB and 8 KB rows. > > - In images with a backing file: allocation is much faster when > subclusters are enabled. As expected, images with a cluster size of > 64KB perform similar to images with a subcluster size of 64KB. When > there is no copy-on-write involved (subcluster size <= 4KB) then the > maximum performance is achieved. > > - In images without a backing file: Since commit c8bb23cbdb when empty > clusters are allocated for the first time they are filled with > zeroes using an efficient method (typically fallocate() with > FALLOC_FL_ZERO_RANGE). This is so fast that having subclusters here > is actually a bit slower in most cases (although it still saves disk > space). > > - The 16 KB cluster / 512 byte subcluster case is quite slow. > I haven't debugged this but I suspect that this is because new > clusters need to be allocated all the time, and also L2 and refcount > tables are very small and need to grow all the time. The same pattern > can be seen in images without subclusters. > > === To do === > > A couple of things are missing from this series: > > - The ability to efficiently zero individual subclusters using > qcow2_co_pwrite_zeroes(). At the moment only full clusters can be > zeroed with this method. > > - Alternatively we could get rid of the individual "all zeroes" bits > altogether and have 64 subclusters per cluster. We would still have > the QCOW_OFLAG_ZERO bit in the standard cluster descriptor. > > - The number of subclusters per cluster is always 32. It would be > trivial to allow configuring this, but I don't see any use case. > > - Tests: I have a few written that I'll add in future revisions of > this series. > > - handle_alloc_space() works at the subclusters level. That is, if you > have an unallocated 2MB cluster with 64KB subclusters, no backing > image and you write 4KB of data, QEMU won't write zeroes to the > affected subcluster(s) and will use handle_alloc_space() instead. > The other subclusters won't be touched and will remain unallocated. > This behavior is consistent with how subclusters work and saves disk > space, but offers slightly lower performance (see test results > above). Theoretically we could offer a setting to configure this, > but I'm not convinced that this is very useful. > > =========================== > > As usual, feedback is welcome, > > Berto > > Alberto Garcia (23): > qcow2: Add calculate_l2_meta() > qcow2: Split cluster_needs_cow() out of count_cow_clusters() > qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in handle_copied() > qcow2: Add get_l2_entry() and set_l2_entry() > qcow2: Document the Extended L2 Entries feature > qcow2: Add dummy has_subclusters() function > qcow2: Add subcluster-related fields to BDRVQcow2State > qcow2: Add offset_to_sc_index() > qcow2: Add l2_entry_size() > qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap() > qcow2: Add qcow2_get_subcluster_type() > qcow2: Handle QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER > qcow2: Add subcluster support to calculate_l2_meta() > qcow2: Add subcluster support to qcow2_get_cluster_offset() > qcow2: Add subcluster support to zero_in_l2_slice() > qcow2: Add subcluster support to discard_in_l2_slice() > qcow2: Add subcluster support to check_refcounts_l2() > qcow2: Add subcluster support to expand_zero_clusters_in_l1() > qcow2: Fix offset calculation in handle_dependencies() > qcow2: Update L2 bitmap in qcow2_alloc_cluster_link_l2() > qcow2: Add subcluster support to handle_alloc_space() > qcow2: Restrict qcow2_co_pwrite_zeroes() to full clusters only > qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit > > block/qcow2-cluster.c | 547 ++++++++++++++++++++----------- > block/qcow2-refcount.c | 38 ++- > block/qcow2.c | 83 ++++- > block/qcow2.h | 121 ++++++- > docs/interop/qcow2.txt | 68 +++- > docs/qcow2-cache.txt | 19 +- > include/block/block_int.h | 1 + > qapi/block-core.json | 2 + > tests/qemu-iotests/031.out | 8 +- > tests/qemu-iotests/036.out | 4 +- > tests/qemu-iotests/049.out | 102 +++--- > tests/qemu-iotests/060.out | 1 + > tests/qemu-iotests/061.out | 20 +- > tests/qemu-iotests/065 | 18 +- > tests/qemu-iotests/082.out | 48 ++- > tests/qemu-iotests/085.out | 38 +-- > tests/qemu-iotests/144.out | 4 +- > tests/qemu-iotests/182.out | 2 +- > tests/qemu-iotests/185.out | 8 +- > tests/qemu-iotests/198.out | 2 + > tests/qemu-iotests/206.out | 4 + > tests/qemu-iotests/242.out | 5 + > tests/qemu-iotests/255.out | 8 +- > tests/qemu-iotests/common.filter | 1 + > 24 files changed, 817 insertions(+), 335 deletions(-) > -- Best regards, Vladimir