Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

Vladimir Sementsov-Ogievskiy Wed, 23 Oct 2019 04:09:41 -0700

Hi!

This is very interesting! Could you please export a branch to look at,
as patches can't be applied on master now :(


15.10.2019 18:23, Alberto Garcia wrote:
> Hi,
> 
> this series adds a new feature to the qcow2 on-disk format called
> "Extended L2 Entries", which allows us to do subcluster allocation.
> 
> This cover letter explains the reasons behind this proposal, the
> changes to the on-disk format, test results and pending work. If you
> are curious you can also have a look at previous discussions about
> this feature:
> 
>     https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
>     https://lists.gnu.org/archive/html/qemu-block/2019-06/msg01155.html
> 
> This is the first proper version of the patches, and I believe that
> the implementation is complete. However since I'm proposing a change
> to the on-disk format I'm labeling this as RFC because I'm expecting
> some debate. I'll remove the RFC tag and add more tests in future
> revisions.
> 
> === Problem ===
> 
> A qcow2 image is divided into units of constant size called clusters,
> and among other things it contains metadata that maps guest addresses
> to host addresses (the so-called L1 and L2 tables).
> 
> There are two basic problems that result from this:
> 
> 1) Reading from or writing to a qcow2 image involves reading the
>     corresponding entry on the L2 table that maps the guest address to
>     the host address. This is very slow because it involves two I/O
>     operations: one on the L2 table and the other one on the actual
>     data cluster.
> 
> 2) A cluster is the smallest unit of allocation. Therefore writing a
>     mere 512 bytes to an empty disk requires allocating a complete
>     cluster and filling it with zeroes (or with data from the backing
>     image if there is one). This wastes more disk space and also has a
>     negative impact on I/O.
> 
> Problem (1) can be solved by caching the L2 tables in memory. The
> maximum amount of disk space used by L2 tables depends on the virtual
> disk size and the cluster size:
> 
>     max_l2_size = virtual_disk_size * 8 / cluster_size
> 
> Because of this, the only way to reduce the size of the L2 tables is
> by increasing the cluster size (which can be any power of two between
> 512 bytes and 2 MB). But then we hit problem (2): I/O is slower and
> more disk space is wasted.
> 
> === The proposal ===
> 
> The proposal is to extend the qcow2 format by allowing subcluster
> allocation. The on-disk format remains essentially the same, except
> that each data cluster is internally divided into 32 subclusters of
> equal size.
> 
> The way it works in practice is with a new optional feature called
> "Extended L2 Entries", that needs to be enabled when an image is
> created. With this, each entry on an L2 table is accompanied by a
> bitmap indicating the allocation state of each one of the subclusters
> for that cluster. The size of an L2 entry doubles from 64 to 128 bits.
> 
> Other than L2 entries, all other data structures remain unchanged, but
> for data clusters the smallest unit of allocation is now the
> subcluster. Reference counting is still at the cluster level, because
> there is no way to reference individual subclusters. Copy-on-write on
> internal snapshots needs to copy complete clusters, so that scenario
> would not benefit from this change.
> 
> I see two main use cases for this feature:
> 
> a) The qcow2 image is not too large / the L2 cache is not a problem,
>     but you want to increase the allocation performance. In this case
>     you can have a 128KB cluster with 4KB subclusters (with 4KB being a
>     common block size in ext4 and other filesystems)
> 
> b) The qcow2 image is very large and you want to save metadata space
>     in order to have a smaller L2 cache. In this case you can go for
>     the maximum cluster size (2MB) but you want to have smaller
>     subclusters to increase the allocation performance and optimize the
>     disk usage.
> 
> === Changes to the on-disk format ===
> 
> An L2 entry is 64 bits wide, with this format (for uncompressed
> clusters):
> 
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
>    Rsrved              host cluster offset of data             Reserved
>    (6 bits)                (47 bits)                           (8 bits)
> 
>      bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>      bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>      bit  0: all zeros       (QCOW_OFLAG_ZERO)
> 
> If Extended L2 Entries are enabled, bit 0 becomes reserved and must be
> unset, and this 64-bit bitmap follows the entry:
> 
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <---------------------------------> <--------------------------------->
>       subcluster reads as zeros            subcluster is allocated
>               (32 bits)                           (32 bits)
> 
> All this applies to uncompressed clusters. Compressed clusters are not
> divided into subclusters, the cluster descriptor remains exactly the
> same, and the 64-bit bitmap is not used (i.e. all bits are always 0).
> 
> === Test results ===
> 
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB. I ran the tests with all available
> cluster sizes starting from 4KB.
> 
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
> 
> Subcluster sizes are in brackets:
> 
> |-----------------+----------------+-----------------|
> |   Cluster size  | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> |    4 KB ( N/A ) |            N/A |         95 IOPS |
> |    8 KB ( N/A ) |            N/A |        599 IOPS |
> |   16 KB (512 B) |      4129 IOPS |       3597 IOPS |
> |   32 KB  (1 KB) |     11255 IOPS |       2642 IOPS |
> |   64 KB  (2 KB) |     13341 IOPS |       1671 IOPS |
> |  128 KB  (4 KB) |     12391 IOPS |        870 IOPS |
> |  256 KB  (8 KB) |      9645 IOPS |        566 IOPS |
> |  512 KB (16 KB) |      4960 IOPS |        359 IOPS |
> | 1024 KB (32 KB) |      2732 IOPS |        215 IOPS |
> | 2048 KB (64 KB) |      1630 IOPS |        214 IOPS |
> |-----------------+----------------+-----------------|
> 
> Here are the same tests, but without any backing image:
> 
> |-----------------+----------------+-----------------|
> |   Cluster size  | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> |    4 KB ( N/A ) |            N/A |         93 IOPS |
> |    8 KB ( N/A ) |            N/A |        539 IOPS |
> |   16 KB (512 B) |      4174 IOPS |       7598 IOPS |
> |   32 KB  (1 KB) |     11326 IOPS |      11957 IOPS |
> |   64 KB  (2 KB) |     13516 IOPS |      13375 IOPS |
> |  128 KB  (4 KB) |     12435 IOPS |      13274 IOPS |
> |  256 KB  (8 KB) |     12071 IOPS |      14174 IOPS |
> |  512 KB (16 KB) |     12169 IOPS |      14343 IOPS |
> | 1024 KB (32 KB) |     12307 IOPS |      14622 IOPS |
> | 2048 KB (64 KB) |     12784 IOPS |      14574 IOPS |
> |-----------------+----------------+-----------------|
> 
> Some comments about the results:
> 
> - The smallest allowed cluster size for an image with subclusters is
>    16 KB (in this case the subclusters size is 512 bytes), hence the
>    missing values in the 4 KB and 8 KB rows.
> 
> - In images with a backing file: allocation is much faster when
>    subclusters are enabled. As expected, images with a cluster size of
>    64KB perform similar to images with a subcluster size of 64KB. When
>    there is no copy-on-write involved (subcluster size <= 4KB) then the
>    maximum performance is achieved.
> 
> - In images without a backing file: Since commit c8bb23cbdb when empty
>    clusters are allocated for the first time they are filled with
>    zeroes using an efficient method (typically fallocate() with
>    FALLOC_FL_ZERO_RANGE). This is so fast that having subclusters here
>    is actually a bit slower in most cases (although it still saves disk
>    space).
> 
> - The 16 KB cluster / 512 byte subcluster case is quite slow.
>    I haven't debugged this but I suspect that this is because new
>    clusters need to be allocated all the time, and also L2 and refcount
>    tables are very small and need to grow all the time. The same pattern
>    can be seen in images without subclusters.
> 
> === To do ===
> 
> A couple of things are missing from this series:
> 
> - The ability to efficiently zero individual subclusters using
>    qcow2_co_pwrite_zeroes(). At the moment only full clusters can be
>    zeroed with this method.
> 
> - Alternatively we could get rid of the individual "all zeroes" bits
>    altogether and have 64 subclusters per cluster. We would still have
>    the QCOW_OFLAG_ZERO bit in the standard cluster descriptor.
> 
> - The number of subclusters per cluster is always 32. It would be
>    trivial to allow configuring this, but I don't see any use case.
> 
> - Tests: I have a few written that I'll add in future revisions of
>    this series.
> 
> - handle_alloc_space() works at the subclusters level. That is, if you
>    have an unallocated 2MB cluster with 64KB subclusters, no backing
>    image and you write 4KB of data, QEMU won't write zeroes to the
>    affected subcluster(s) and will use handle_alloc_space() instead.
>    The other subclusters won't be touched and will remain unallocated.
>    This behavior is consistent with how subclusters work and saves disk
>    space, but offers slightly lower performance (see test results
>    above). Theoretically we could offer a setting to configure this,
>    but I'm not convinced that this is very useful.
> 
> ===========================
> 
> As usual, feedback is welcome,
> 
> Berto
> 
> Alberto Garcia (23):
>    qcow2: Add calculate_l2_meta()
>    qcow2: Split cluster_needs_cow() out of count_cow_clusters()
>    qcow2: Process QCOW2_CLUSTER_ZERO_ALLOC clusters in handle_copied()
>    qcow2: Add get_l2_entry() and set_l2_entry()
>    qcow2: Document the Extended L2 Entries feature
>    qcow2: Add dummy has_subclusters() function
>    qcow2: Add subcluster-related fields to BDRVQcow2State
>    qcow2: Add offset_to_sc_index()
>    qcow2: Add l2_entry_size()
>    qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap()
>    qcow2: Add qcow2_get_subcluster_type()
>    qcow2: Handle QCOW2_CLUSTER_UNALLOCATED_SUBCLUSTER
>    qcow2: Add subcluster support to calculate_l2_meta()
>    qcow2: Add subcluster support to qcow2_get_cluster_offset()
>    qcow2: Add subcluster support to zero_in_l2_slice()
>    qcow2: Add subcluster support to discard_in_l2_slice()
>    qcow2: Add subcluster support to check_refcounts_l2()
>    qcow2: Add subcluster support to expand_zero_clusters_in_l1()
>    qcow2: Fix offset calculation in handle_dependencies()
>    qcow2: Update L2 bitmap in qcow2_alloc_cluster_link_l2()
>    qcow2: Add subcluster support to handle_alloc_space()
>    qcow2: Restrict qcow2_co_pwrite_zeroes() to full clusters only
>    qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit
> 
>   block/qcow2-cluster.c            | 547 ++++++++++++++++++++-----------
>   block/qcow2-refcount.c           |  38 ++-
>   block/qcow2.c                    |  83 ++++-
>   block/qcow2.h                    | 121 ++++++-
>   docs/interop/qcow2.txt           |  68 +++-
>   docs/qcow2-cache.txt             |  19 +-
>   include/block/block_int.h        |   1 +
>   qapi/block-core.json             |   2 +
>   tests/qemu-iotests/031.out       |   8 +-
>   tests/qemu-iotests/036.out       |   4 +-
>   tests/qemu-iotests/049.out       | 102 +++---
>   tests/qemu-iotests/060.out       |   1 +
>   tests/qemu-iotests/061.out       |  20 +-
>   tests/qemu-iotests/065           |  18 +-
>   tests/qemu-iotests/082.out       |  48 ++-
>   tests/qemu-iotests/085.out       |  38 +--
>   tests/qemu-iotests/144.out       |   4 +-
>   tests/qemu-iotests/182.out       |   2 +-
>   tests/qemu-iotests/185.out       |   8 +-
>   tests/qemu-iotests/198.out       |   2 +
>   tests/qemu-iotests/206.out       |   4 +
>   tests/qemu-iotests/242.out       |   5 +
>   tests/qemu-iotests/255.out       |   8 +-
>   tests/qemu-iotests/common.filter |   1 +
>   24 files changed, 817 insertions(+), 335 deletions(-)
> 


-- 
Best regards,
Vladimir

Re: [RFC PATCH 00/23] Add subcluster allocation to qcow2

Reply via email to