This is a RFC series. Tree is here: https://github.com/xzpeter/qemu/tree/preempt-full
It's not complete because there're still something we need to do which will be attached to the end of this cover letter, however this series can already safely pass qtest and any of my test. Comparing to the recently merged preempt mode I called it "preempt-full" because it threadifies the postcopy channels so now urgent pages can be fully handled separately outside of the ram save loop. Sorry to have the same name as the PREEMPT_FULL in the Linux RT world, it's just that we needed a name for the capability and it was named as preempt already anyway.. The existing preempt code has reduced ramdom page req latency over 10Gbps network from ~12ms to ~500us which has already landed. This preempt-full series can further reduces that ~500us to ~230us per my initial test. More to share below. Note that no new capability is needed, IOW it's fully compatible with the existing preempt mode. So the naming is actually not important but just to identify the difference on the binaries. It's because this series only reworks the sender side code and does not change the migration protocol, it just runs faster. IOW, old "preempt" QEMU can also migrate to "preempt-full" QEMU, vice versa. - When old "preempt" mode QEMU migrates to "preempt-full" QEMU, it'll be the same as running both old "preempt" QEMUs. - When "preempt-full" QEMU migrates to old "preempt" QEMU, it'll be the same as running both "preempt-full". The logic of the series is quite simple too: simply moving the existing preempt channel page sends to rp-return thread. It can slow down rp-return thread on receiving pages, but I don't really see a major issue with it so far. This latency number is getting close to the extreme of 4K page request latency of any TCP roundtrip of the 10Gbps nic I have. The 'extreme number' is something I get from mig_mon tool which has a mode [1] to emulate the extreme tcp roundtrips of page requests. Performance =========== Page request latencies has distributions as below, with a VM of 20G mem, 20 cores, 10Gbps nic, 18G fully random writes: Postcopy Vanilla ---------------- Average: 12093 (us) @delay_us: [1] 1 | | [2, 4) 0 | | [4, 8) 0 | | [8, 16) 0 | | [16, 32) 1 | | [32, 64) 8 | | [64, 128) 11 | | [128, 256) 14 | | [256, 512) 19 | | [512, 1K) 14 | | [1K, 2K) 35 | | [2K, 4K) 18 | | [4K, 8K) 87 |@ | [8K, 16K) 2397 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 7 | | [32K, 64K) 2 | | [64K, 128K) 20 | | [128K, 256K) 6 | | Postcopy Preempt ---------------- Average: 496 (us) @delay_us: [32, 64) 2 | | [64, 128) 2306 |@@@@ | [128, 256) 25422 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 8238 |@@@@@@@@@@@@@@@@ | [512, 1K) 1066 |@@ | [1K, 2K) 2167 |@@@@ | [2K, 4K) 3329 |@@@@@@ | [4K, 8K) 109 | | [8K, 16K) 48 | | Postcopy Preempt-Full --------------------- Average: 229 (us) @delay_us: [8, 16) 1 | | [16, 32) 3 | | [32, 64) 2 | | [64, 128) 11956 |@@@@@@@@@@ | [128, 256) 60403 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 15047 |@@@@@@@@@@@@ | [512, 1K) 846 | | [1K, 2K) 25 | | [2K, 4K) 41 | | [4K, 8K) 131 | | [8K, 16K) 72 | | [16K, 32K) 2 | | [32K, 64K) 8 | | [64K, 128K) 6 | | For fully sequential page access workloads, I have described in the previous preempt-mode work that such workload may not benefit much from preempt mode much, but surprisingly at least in my seq write test the preempt-full mode can also benefit sequential access patterns at least when I measured it: Postcopy Vanilla ---------------- Average: 1487 (us) @delay_us: [0] 93 |@ | [1] 1920 |@@@@@@@@@@@@@@@@@@@@@@@ | [2, 4) 504 |@@@@@@ | [4, 8) 2234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 4199 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 3782 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 1016 |@@@@@@@@@@@@ | [64, 128) 81 |@ | [128, 256) 14 | | [256, 512) 26 | | [512, 1K) 69 | | [1K, 2K) 208 |@@ | [2K, 4K) 429 |@@@@@ | [4K, 8K) 2779 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 792 |@@@@@@@@@ | [16K, 32K) 9 | | Postcopy Preempt-Full --------------------- Average: 1582 (us) @delay_us: [0] 45 | | [1] 1786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [2, 4) 423 |@@@@@@@ | [4, 8) 1903 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 2933 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 3132 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32, 64) 518 |@@@@@@@@ | [64, 128) 30 | | [128, 256) 218 |@@@ | [256, 512) 214 |@@@ | [512, 1K) 211 |@@@ | [1K, 2K) 131 |@@ | [2K, 4K) 336 |@@@@@ | [4K, 8K) 3023 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 479 |@@@@@@@ | Postcopy Preempt-Full --------------------- Average: 439 (us) @delay_us: [0] 3 | | [1] 1058 |@ | [2, 4) 179 | | [4, 8) 1079 |@ | [8, 16) 2251 |@@@ | [16, 32) 2345 |@@@@ | [32, 64) 713 |@ | [64, 128) 5386 |@@@@@@@@@ | [128, 256) 30252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 10789 |@@@@@@@@@@@@@@@@@@ | [512, 1K) 367 | | [1K, 2K) 26 | | [2K, 4K) 256 | | [4K, 8K) 1840 |@@@ | [8K, 16K) 300 | | I always don't think seq access is important in migrations, because for any not-small VM that has a migration challenge, any multiple seq accesses will also be grown into a random access pattern. But I'm anyway laying the data around for good reference. Comments welcomed, thanks. TODO List ========= - Make migration accountings atomic - Drop rs->f? - Disable xbzrle for preempt mode? Is it already perhaps disabled for postcopy? - If this series can be really accepted, we can logically drop some of the old (complcated) code with the old preempt series. - Drop x-postcopy-preempt-break-huge parameter? - More to come [1] https://github.com/xzpeter/mig_mon#vm-live-migration-network-emulator Peter Xu (13): migration: Use non-atomic ops for clear log bitmap migration: Add postcopy_preempt_active() migration: Yield bitmap_mutex properly when sending/sleeping migration: Cleanup xbzrle zero page cache update logic migration: Disallow postcopy preempt to be used with compress migration: Trivial cleanup save_page_header() on same block check migration: Remove RAMState.f references in compression code migration: Teach PSS about host page migration: Introduce pss_channel migration: Add pss_init() migration: Make PageSearchStatus part of RAMState migration: Move last_sent_block into PageSearchStatus migration: Send requested page directly in rp-return thread include/exec/ram_addr.h | 11 +- include/qemu/bitmap.h | 1 + migration/migration.c | 11 + migration/ram.c | 496 +++++++++++++++++++++++++++++----------- util/bitmap.c | 45 ++++ 5 files changed, 421 insertions(+), 143 deletions(-) -- 2.32.0