On Mon, Aug 29, 2022 at 12:56:46PM -0400, Peter Xu wrote: > This is a RFC series. Tree is here: > > https://github.com/xzpeter/qemu/tree/preempt-full > > It's not complete because there're still something we need to do which will > be attached to the end of this cover letter, however this series can > already safely pass qtest and any of my test. > > Comparing to the recently merged preempt mode I called it "preempt-full" > because it threadifies the postcopy channels so now urgent pages can be > fully handled separately outside of the ram save loop. Sorry to have the > same name as the PREEMPT_FULL in the Linux RT world, it's just that we > needed a name for the capability and it was named as preempt already > anyway.. > > The existing preempt code has reduced ramdom page req latency over 10Gbps > network from ~12ms to ~500us which has already landed. > > This preempt-full series can further reduces that ~500us to ~230us per my > initial test. More to share below. > > Note that no new capability is needed, IOW it's fully compatible with the > existing preempt mode. So the naming is actually not important but just to > identify the difference on the binaries. It's because this series only > reworks the sender side code and does not change the migration protocol, it > just runs faster. > > IOW, old "preempt" QEMU can also migrate to "preempt-full" QEMU, vice versa. > > - When old "preempt" mode QEMU migrates to "preempt-full" QEMU, it'll be > the same as running both old "preempt" QEMUs. > > - When "preempt-full" QEMU migrates to old "preempt" QEMU, it'll be the > same as running both "preempt-full". > > The logic of the series is quite simple too: simply moving the existing > preempt channel page sends to rp-return thread. It can slow down rp-return > thread on receiving pages, but I don't really see a major issue with it so > far. > > This latency number is getting close to the extreme of 4K page request > latency of any TCP roundtrip of the 10Gbps nic I have. The 'extreme > number' is something I get from mig_mon tool which has a mode [1] to > emulate the extreme tcp roundtrips of page requests. > > Performance > =========== > > Page request latencies has distributions as below, with a VM of 20G mem, 20 > cores, 10Gbps nic, 18G fully random writes: > > Postcopy Vanilla > ---------------- > > Average: 12093 (us) > @delay_us: > [1] 1 | > | > [2, 4) 0 | > | > [4, 8) 0 | > | > [8, 16) 0 | > | > [16, 32) 1 | > | > [32, 64) 8 | > | > [64, 128) 11 | > | > [128, 256) 14 | > | > [256, 512) 19 | > | > [512, 1K) 14 | > | > [1K, 2K) 35 | > | > [2K, 4K) 18 | > | > [4K, 8K) 87 |@ > | > [8K, 16K) 2397 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 7 | > | > [32K, 64K) 2 | > | > [64K, 128K) 20 | > | > [128K, 256K) 6 | > | > > Postcopy Preempt > ---------------- > > Average: 496 (us) > > @delay_us: > [32, 64) 2 | > | > [64, 128) 2306 |@@@@ > | > [128, 256) 25422 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [256, 512) 8238 |@@@@@@@@@@@@@@@@ > | > [512, 1K) 1066 |@@ > | > [1K, 2K) 2167 |@@@@ > | > [2K, 4K) 3329 |@@@@@@ > | > [4K, 8K) 109 | > | > [8K, 16K) 48 | > | > > Postcopy Preempt-Full > --------------------- > > Average: 229 (us) > > @delay_us: > [8, 16) 1 | > | > [16, 32) 3 | > | > [32, 64) 2 | > | > [64, 128) 11956 |@@@@@@@@@@ > | > [128, 256) 60403 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [256, 512) 15047 |@@@@@@@@@@@@ > | > [512, 1K) 846 | > | > [1K, 2K) 25 | > | > [2K, 4K) 41 | > | > [4K, 8K) 131 | > | > [8K, 16K) 72 | > | > [16K, 32K) 2 | > | > [32K, 64K) 8 | > | > [64K, 128K) 6 | > | > > For fully sequential page access workloads, I have described in the > previous preempt-mode work that such workload may not benefit much from > preempt mode much, but surprisingly at least in my seq write test the > preempt-full mode can also benefit sequential access patterns at least when > I measured it: > > Postcopy Vanilla > ---------------- > > Average: 1487 (us) > > @delay_us: > [0] 93 |@ > | > [1] 1920 |@@@@@@@@@@@@@@@@@@@@@@@ > | > [2, 4) 504 |@@@@@@ > | > [4, 8) 2234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [8, 16) 4199 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16, 32) 3782 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [32, 64) 1016 |@@@@@@@@@@@@ > | > [64, 128) 81 |@ > | > [128, 256) 14 | > | > [256, 512) 26 | > | > [512, 1K) 69 | > | > [1K, 2K) 208 |@@ > | > [2K, 4K) 429 |@@@@@ > | > [4K, 8K) 2779 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [8K, 16K) 792 |@@@@@@@@@ > | > [16K, 32K) 9 | > | > > Postcopy Preempt-Full > --------------------- > > Average: 1582 (us) > > @delay_us: > [0] 45 | > | > [1] 1786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [2, 4) 423 |@@@@@@@ > | > [4, 8) 1903 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [8, 16) 2933 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [16, 32) 3132 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32, 64) 518 |@@@@@@@@ > | > [64, 128) 30 | > | > [128, 256) 218 |@@@ > | > [256, 512) 214 |@@@ > | > [512, 1K) 211 |@@@ > | > [1K, 2K) 131 |@@ > | > [2K, 4K) 336 |@@@@@ > | > [4K, 8K) 3023 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [8K, 16K) 479 |@@@@@@@ > | > > Postcopy Preempt-Full > --------------------- > > Average: 439 (us) > > @delay_us: > [0] 3 | > | > [1] 1058 |@ > | > [2, 4) 179 | > | > [4, 8) 1079 |@ > | > [8, 16) 2251 |@@@ > | > [16, 32) 2345 |@@@@ > | > [32, 64) 713 |@ > | > [64, 128) 5386 |@@@@@@@@@ > | > [128, 256) 30252 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [256, 512) 10789 |@@@@@@@@@@@@@@@@@@ > | > [512, 1K) 367 | > | > [1K, 2K) 26 | > | > [2K, 4K) 256 | > | > [4K, 8K) 1840 |@@@ > | > [8K, 16K) 300 | > | > > I always don't think seq access is important in migrations, because for any > not-small VM that has a migration challenge, any multiple seq accesses will > also be grown into a random access pattern. But I'm anyway laying the data > around for good reference. > > Comments welcomed, thanks. > > TODO List > ========= > > - Make migration accountings atomic > - Drop rs->f? > - Disable xbzrle for preempt mode? Is it already perhaps disabled for > postcopy? > - If this series can be really accepted, we can logically drop some of the > old (complcated) code with the old preempt series. > - Drop x-postcopy-preempt-break-huge parameter? > - More to come > > [1] https://github.com/xzpeter/mig_mon#vm-live-migration-network-emulator > > Peter Xu (13): > migration: Use non-atomic ops for clear log bitmap > migration: Add postcopy_preempt_active() > migration: Yield bitmap_mutex properly when sending/sleeping > migration: Cleanup xbzrle zero page cache update logic > migration: Disallow postcopy preempt to be used with compress > migration: Trivial cleanup save_page_header() on same block check > migration: Remove RAMState.f references in compression code > migration: Teach PSS about host page > migration: Introduce pss_channel > migration: Add pss_init() > migration: Make PageSearchStatus part of RAMState > migration: Move last_sent_block into PageSearchStatus > migration: Send requested page directly in rp-return thread
Side note: Not all the patches here are servicing the preempt-full goal. E.g. we could consider reviewing/merging patch 1, 5 earlier: patch 1 is long standing perf improvement on clear bitmap ops, patch 5 should be seem as a fix I think. Some other trivial cleanup patches can be picked up too but not urgent. -- Peter Xu