Based-on: <20211224065000.97572-1-pet...@redhat.com> Human version - This patchset is based on: https://lore.kernel.org/qemu-devel/20211224065000.97572-1-pet...@redhat.com/
This series can also be found here: https://github.com/xzpeter/qemu/tree/postcopy-preempt Abstract ======== This series added a new migration capability called "postcopy-preempt". It can be enabled when postcopy is enabled, and it'll simply (but greatly) speed up postcopy page requests handling process. Some quick tests below measuring postcopy page request latency: - Guest config: 20G guest, 40 vcpus - Host config: 10Gbps host NIC attached between src/dst - Workload: one busy dirty thread, writting to 18G memory (pre-faulted). (refers to "2M/4K huge page, 1 dirty thread" tests below) - Script: see [1] |----------------+--------------+-----------------------| | Host page size | Vanilla (ms) | Postcopy Preempt (ms) | |----------------+--------------+-----------------------| | 2M | 10.58 | 4.96 | | 4K | 10.68 | 0.57 | |----------------+--------------+-----------------------| For 2M page, we got 1x speedup. For 4K page, 18x speedup. For more information on the testing, please refer to "Test Results" below. Design ====== The postcopy-preempt feature contains two major reworks on postcopy page fault handlings: (1) Postcopy requests are now sent via a different socket from precopy background migration stream, so as to be isolated from very high page request delays (2) For huge page enabled hosts: when there's postcopy requests, they can now intercept a partial sending of huge host pages on src QEMU. The design is relatively straightforward, however there're trivial implementation details that the patchset needs to address. Many of them are addressed as separate patches. The rest is handled majorly in the big patch to enable the whole feature. Postcopy recovery is not yet supported, it'll be done after some initial review on the solution first. Patch layout ============ The initial 10 (out of 15) patches are mostly even suitable to be merged without the new feature, so they can be looked at even earlier. Patch 11-14 implements the new feature, in which patches 11-13 are mostly still small and doing preparations, and the major change is done in patch 14. Patch 15 is an unit test. Tests Results ================== When measuring the page request latency, I did that via trapping userfaultfd kernel faults using the bpf script [1]. I ignored kvm fast page faults, because when it happened it means no major/real page fault is even needed, IOW, no query to src QEMU. The numbers (and histogram) I captured below are based on a whole procedure of postcopy migration that I sampled with different configurations, and the average page request latency was calculated. I also captured the latency distribution, it's also interesting too to look at them here. One thing to mention is I didn't even test 1G pages. It doesn't mean that this series won't help 1G - actually it'll help no less than what I've tested I believe, it's just that for 1G huge pages the latency will be >1sec on 10Gbps nic so it's not really a usable scenario for any sensible customer. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2M huge page, 1 dirty thread ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With vanilla postcopy: Average: 10582 (us) @delay_us: [1K, 2K) 7 | | [2K, 4K) 1 | | [4K, 8K) 9 | | [8K, 16K) 1983 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| With postcopy-preempt: Average: 4960 (us) @delay_us: [1K, 2K) 5 | | [2K, 4K) 44 | | [4K, 8K) 3495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 154 |@@ | [16K, 32K) 1 | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4K small page, 1 dirty thread ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With vanilla postcopy: Average: 10676 (us) @delay_us: [4, 8) 1 | | [8, 16) 3 | | [16, 32) 5 | | [32, 64) 3 | | [64, 128) 12 | | [128, 256) 10 | | [256, 512) 27 | | [512, 1K) 5 | | [1K, 2K) 11 | | [2K, 4K) 17 | | [4K, 8K) 10 | | [8K, 16K) 2681 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 6 | | With postcopy preempt: Average: 570 (us) @delay_us: [16, 32) 5 | | [32, 64) 6 | | [64, 128) 8340 |@@@@@@@@@@@@@@@@@@ | [128, 256) 23052 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 8119 |@@@@@@@@@@@@@@@@@@ | [512, 1K) 148 | | [1K, 2K) 759 |@ | [2K, 4K) 6729 |@@@@@@@@@@@@@@@ | [4K, 8K) 80 | | [8K, 16K) 115 | | [16K, 32K) 32 | | So one thing funny about 4K small pages is that with vanilla postcopy I didn't even get a speedup comparing to 2M pages, probably because the major overhead is not sending the page itself, but other things (e.g. waiting for precopy to flush the existing pages). The other thing is in postcopy preempt test, I can still see a bunch of 2ms-4ms latency page requests. That's probably what we would like to dig into next. One possibility is since we shared the same sending thread on src QEMU, we could have yield ourselves because precopy socket is full. But that's TBD. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4K small page, 16 dirty threads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ What I did test in extra was using 16 concurrent faulting threads, in this case the postcopy queue can be relatively longer. It's done via: $ stress -m 16 --vm-bytes 1073741824 --vm-keep With vanilla postcopy: Average: 2244 (us) @delay_us: [0] 556 | | [1] 11251 |@@@@@@@@@@@@ | [2, 4) 12094 |@@@@@@@@@@@@@ | [4, 8) 12234 |@@@@@@@@@@@@@ | [8, 16) 47144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 42281 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 17676 |@@@@@@@@@@@@@@@@@@@ | [64, 128) 952 |@ | [128, 256) 405 | | [256, 512) 779 | | [512, 1K) 1003 |@ | [1K, 2K) 1976 |@@ | [2K, 4K) 4865 |@@@@@ | [4K, 8K) 5892 |@@@@@@ | [8K, 16K) 26941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 844 | | [32K, 64K) 17 | | With postcopy preempt: Average: 1064 (us) @delay_us: [0] 1341 | | [1] 30211 |@@@@@@@@@@@@ | [2, 4) 32934 |@@@@@@@@@@@@@ | [4, 8) 21295 |@@@@@@@@ | [8, 16) 130774 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 95128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 49591 |@@@@@@@@@@@@@@@@@@@ | [64, 128) 3921 |@ | [128, 256) 1066 | | [256, 512) 2730 |@ | [512, 1K) 1849 | | [1K, 2K) 512 | | [2K, 4K) 2355 | | [4K, 8K) 48812 |@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 10026 |@@@ | [16K, 32K) 810 | | [32K, 64K) 68 | | In this specific case, a funny thing is when there're tons of postcopy requests, the vanilla postcopy page requests are handled even faster (2ms average) than when there's only 1 dirty thread. It's probably because unqueue_page() will always hit anyway so precopy streaming has a less effect on postcopy. However that'll be still slower than having a standalone postcopy stream as preempt version has (1ms). Any comment welcomed. [1] https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf Peter Xu (15): migration: No off-by-one for pss->page update in host page size migration: Allow pss->page jump over clean pages migration: Enable UFFD_FEATURE_THREAD_ID even without blocktime feat migration: Add postcopy_has_request() migration: Simplify unqueue_page() migration: Move temp page setup and cleanup into separate functions migration: Introduce postcopy channels on dest node migration: Dump ramblock and offset too when non-same-page detected migration: Add postcopy_thread_create() migration: Move static var in ram_block_from_stream() into global migration: Add pss.postcopy_requested status migration: Move migrate_allow_multifd and helpers into migration.c migration: Add postcopy-preempt capability migration: Postcopy preemption on separate channel tests: Add postcopy preempt test migration/migration.c | 107 +++++++-- migration/migration.h | 55 ++++- migration/multifd.c | 19 +- migration/multifd.h | 2 - migration/postcopy-ram.c | 192 ++++++++++++---- migration/postcopy-ram.h | 14 ++ migration/ram.c | 417 ++++++++++++++++++++++++++++------- migration/ram.h | 2 + migration/savevm.c | 12 +- migration/socket.c | 18 ++ migration/socket.h | 1 + migration/trace-events | 12 +- qapi/migration.json | 8 +- tests/qtest/migration-test.c | 21 ++ 14 files changed, 716 insertions(+), 164 deletions(-) -- 2.32.0