As has been requested, this series adds new overlap check functions to the qcow2 code. My local branch is called "qcow2-improved-overlap-v1", but I am not so sure whether it is actually an improvement; that is left for you to decide, dear reviewers.
See patch 1 for an explanation of why this series exists and what it does. Patch 1 is basically the core of this series, the rest just employs the functions introduced there. In a later patch, we may want to change the meaning of the "constant" overlap checking option to mean the same as "cached", which is everything except for inactive L2 tables. This series does make checking for overlaps with inactive L2 tables at runtime just as cheap as everything else (constant time plus caching), but using these checks means qemu has to read all the snapshot L1 tables when opening a qcow2 file. This does not take long, of course, but it does result in a bit of overhead so I did not want to enable it by default. I think just enabling all overlap checks by default after this series should be fine and useful, though. Benchmarks ========== First, let me repeat the results I wrote as a reply to the cover letter of v1: I had a 1 GB qcow2 image in tmpfs (sorry, I "only" have 4 GB of RAM in this laptop, and a 2 GB image will kill it) which I accessed from a guest from inside qemu (Arch Linux, to be exact, because its image just dumps you into a console and that's all I need). I ran $ for i in $(seq 1 42); do \ dd if=/dev/zero of=/dev/vda bs=65536 \ done because it does not matter what data is written to the image, I just need to write to a lot of clusters fast to maximize pressure on the overlap check. The results were 13.5/10.5/12.41 % of CPU usage (recorded using perf) in qcow2_check_metadata_overlap() with the current implementation and 0.08/0.09 % with this series applied (0.08 % with overlap-check=cached, 0.09 % with overlap-check=all). Now as a table, just to have one here (and because it is useful when skipping through this text): Current implementation: 12.14 (± 1.519) % (n = 3) With this series applied: 0.08 % (n = 1) overlap-check=all: 0.09 % (n = 1) Now I did some other benchmarks. The first of those is just running (on the host, obviously): $ perf record -ag \ qemu-img create -f qcow2 -o preallocation=metadata,cluster_size=512 \ /tmp/test.qcow2 2G I ran ten rounds with the current implementation, ten rounds with these patches applied and five rounds with cache_size set to 0 in qcow2_create_empty_metadata_list() (which results in only one bitmap existing at one time, and therefore a lot of conversions between the bitmap and the list representation). The current implementation had a CPU usage (fraction of cycles) in qcow2_check_metadata_overlap() of 47.65 (±4.24) %. There was one rather extreme outlier (36.31 %), with that removed it is 48.91 (±1.54) %. With this series applied, the usage dropped to 0.149 (± 0.028) %. Additionally, qcow2_metadata_list_enter() took 0.046 (± 0.021) %. Both together took thus 0.195 (± 0.040) %. With cache_size set to 0, the usage was 0.130 % (± 0.012 %) in qcow2_check_metadata_overlap(); 0.056 % (± 0.011 %) in qcow2_metadata_list_enter(); and 0.186 % (± 0.021 %). It dropped, but still in range of standard deviation. Thus, heavy conversion between bitmap and list conversion (in normal use cases due to random accesses) should not be a problem at all. As a table: Current implementation: 48.91 (± 1.537) % (n = 9) With this series applied: 0.195 (± 0.040) % (n = 10) Only one bitmap cached: 0.186 (± 0.021) % (n = 5) And as a really noticeable consequence: Before this series, creating the image like that took 16.624 s (15.97 s user). With this series, it takes 4.502 s (3.94 s user). Because statistics are fun, I collected the number of metadata list operations for the second and the third tests: Second test (default of 15 bitmaps cached): List to bitmap conversions: 1053 Bitmap to list conversions: 1038 Bitmaps deallocated: 1038 Insert operations: 82266 Remove operations: 14 Queries: 312265 Third test (only one bitmap cached): List to bitmap conversions: 135943 Bitmap to list conversions: 66546 Bitmaps deallocated: 135942 Insert operations: 82266 Remove operations: 14 Queries: 312265 As expected, the number of conversions is much higher. Interestingly, in the first case, every bitmap deallocation also meant a bitmap to list conversion; that means, every bitmap was modified at least once while it was cached. In contrast to that, in the second case, only every second deallocation resulted in a conversion, which means that half of the cached bitmaps were only read (which is bad considering all was done was allocating metadata structures). But for performance, there is no real information here (we only see that setting cache_size to 0 actually did increase the number of conversions). The second benchmark I ran today was a program which opened /dev/vda in qemu (again on Arch) with O_DIRECT | O_SYNC | O_RDWR. It then wrote an uninitialized 512 byte buffer to random places (512-byte-aligned) to a 1 GB image freshly created before VM launch with metadata preallocation and 512 byte clusters. The intention was to break bitmap caching and having to convert back and forth between list and bitmap representation. The results are as follows: Current implementation: 18.73 (± 0.157) % (n = 10) With this series applied: 0.068 (± 0.009) % (n = 10) Only one bitmap cached: 0.062 (± 0.008) % (n = 5) Runtime conclusion ------------------ As can be seen from the benchmarks, runtime CPU usage by the metadata overlap checks is greatly decreased by this series (to 1/150 in the first benchmark, to 1/250 in the second, and to 1/275 in the third). Where is the caveat? -------------------- The problem with this series is obviously that all the metadata information is read when reading a qcow2 image. iotest 044 is a good test for this. I personally haven't seen a problem here, but I can't outrule any. I never noticed any waits when opening images. When approaching this from a theoretical approach, it becomes clear that there shouldn't be any problems here. If using the default of overlap-check=cached, only information available anyway is used to built up the metadata list (the same information that is iterated through for every overlap check in the current implementation). Since building the list simply means setting a bitmask in a bitmap and then transforming that bitmap to a list (which has been shown not pose any performance issues by the second benchmark), there should not be any problem here. So, the only caveat I do see is increased code complexity. I initially thought it might be too complex for its purpose; but after having done the benchmarks, it became apparent to me that there is a problem with the current implementation and that this series does fix it. And RAM usage may pose a problem. RAM usage ========= So I have looked at my 2 GB image above, and the list uses 40 kB, which may or may not be too much (sounds completely fine to me for an image with 512 byte clusters); but it is a least a number I can use for testing the following theoretical inspection: So, once again, we assume the worst. Every metadata structure needs its own entry in the list - actually, the worst would be that every cluster needs its own entry, but that only makes a difference for L1 tables and the refcount table, so we can ignore that. In fact, we can completely ignore those tables; it makes things much easier. Let's name the cluster size "CS", and name the image size in bytes "IS". So, for a refcount width of 16 bits per entry, we can describe CS / 2 clusters per refblock; which means we need (IS / CS) / (CS / 2) = 2 * IS / (CS * CS) refblocks. Let's be generous and say that the L2 tables are capable of describing the complete image size (which in practice they do not because the guest disk size is smaller than the physical size). This assumption also has the neat side-effect of not having to care about snapshots. If we have a lot of internal snapshots with copied L2 tables, IS grows and therefore the next formula knows that the number of L2 tables grows as well. Nice. So, as every L2 table can describe CS / 8 (guest) clusters, we need 8 * IS / (CS * CS) L2 tables. Therefore, ignoring the image header, L1 tables, the reftable and the snapshot table, we have about the following amount of metadata clusters per image (with 16 bit refcount entries): (2 + 8) * IS / (CS * CS) = 10 * IS / (CS * CS) We said that for every metadata cluster, we would need one entry in the list. Every list entry takes 32 bits (4 bytes). Thus, we need approximately up to 40 * IS / (CS * CS) bytes for the metadata list (please ignore units and append "bytes" at the resulting number...). Let's test that for the above image, which has a disk size of 266 MB: 40 * 266M / (512 * 512) = 42 kB Great! It works. So, now let's set CS to 64 kB, because that is basically the only cluster size we really care about. For a 1 TB image, we need 10 kB for the list. Sounds great to me. For a 1 PB image, we will need 10 MB. Fair enough. (Note that you don't need 10 MB of RAM to create a 1 PB image; you only need that once the disk size of the image has reached 1 PB). And 1 TB with 512 byte clusters? 160 MB. Urgh, that is a lot. But then again, you can switch off the overlap check with overlap-check=off; and trying to actually use a 1 TB image with 512 byte clusters is crazy in itself (have you tried just creating one without preallocation? It takes forever). So I can live with that. tl;dr ===== * CPU usage at runtime decreased by 150 to 275 percent on overlap-check-heavy tasks * No additional performance problems at loading time (in theory has the same runtime complexity as a single overlap check right now; in practice I could not find any problems) * Decent RAM usage (40 kB for a 1 TB image with 64 kB clusters; 40 MB for a 1 PB image etc. pp.) As of this version, this series depends on "[PATCH v6] qcow2: Buffer L1 table in snapshot refcount update". v2: - Patch 1: - Fixed a mistake regarding the value of Qcow2MetadataFragment::relative_start; this value should be relative to the window start, not relative to the end of the last fragment (destroy_window_bitmap() wrote the latter there in v1) - Use uint64_t for the window index everywhere - Patch 8: Said "qcow2: Buffer L1 table in snapshot refcount update" removes l1_allocated in qcow2_update_snapshot_refcount() and basically replaces it by active_l1 (where active_l1 = !l1_allocated). In v1, the condition it was used in was actually wrong, this is fixed here, too (the active L2 cluster type should be removed from the list if we are working on the active L1 table, not if we are working on an inactive L1 table). git-backport-diff against v2: Key: [----] : patches are identical [####] : number of functional differences between upstream/downstream patch [down] : patch is downstream-only The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively 001/12:[0010] [FC] 'qcow2: Add new overlap check functions' 002/12:[----] [--] 'qcow2: Pull up overlap check option evaluation' 003/12:[----] [--] 'qcow2: Create metadata list' 004/12:[----] [--] 'qcow2/overlaps: Protect image header' 005/12:[----] [--] 'qcow2/overlaps: Protect refcount table' 006/12:[----] [--] 'qcow2/overlaps: Protect refcount blocks' 007/12:[----] [--] 'qcow2/overlaps: Protect active L1 table' 008/12:[0002] [FC] 'qcow2/overlaps: Protect active L2 tables' 009/12:[----] [--] 'qcow2/overlaps: Protect snapshot table' 010/12:[----] [--] 'qcow2/overlaps: Protect inactive L1 tables' 011/12:[----] [-C] 'qcow2/overlaps: Protect inactive L2 tables' 012/12:[----] [--] 'qcow2: Use new metadata overlap check function' Max Reitz (12): qcow2: Add new overlap check functions qcow2: Pull up overlap check option evaluation qcow2: Create metadata list qcow2/overlaps: Protect image header qcow2/overlaps: Protect refcount table qcow2/overlaps: Protect refcount blocks qcow2/overlaps: Protect active L1 table qcow2/overlaps: Protect active L2 tables qcow2/overlaps: Protect snapshot table qcow2/overlaps: Protect inactive L1 tables qcow2/overlaps: Protect inactive L2 tables qcow2: Use new metadata overlap check function block/Makefile.objs | 3 +- block/qcow2-cluster.c | 13 ++ block/qcow2-overlap.c | 400 +++++++++++++++++++++++++++++++++++++++++++++++++ block/qcow2-refcount.c | 202 ++++++++++--------------- block/qcow2-snapshot.c | 105 ++++++++++++- block/qcow2.c | 130 ++++++++++------ block/qcow2.h | 13 ++ 7 files changed, 694 insertions(+), 172 deletions(-) create mode 100644 block/qcow2-overlap.c -- 1.9.3