[PATCH RFC 00/13] migration: Postcopy Preempt-Full

Peter Xu Mon, 29 Aug 2022 09:57:39 -0700

This is a RFC series.  Tree is here:

  https://github.com/xzpeter/qemu/tree/preempt-full


It's not complete because there're still something we need to do which will
be attached to the end of this cover letter, however this series can
already safely pass qtest and any of my test.

Comparing to the recently merged preempt mode I called it "preempt-full"
because it threadifies the postcopy channels so now urgent pages can be
fully handled separately outside of the ram save loop.  Sorry to have the
same name as the PREEMPT_FULL in the Linux RT world, it's just that we
needed a name for the capability and it was named as preempt already
anyway..

The existing preempt code has reduced ramdom page req latency over 10Gbps
network from ~12ms to ~500us which has already landed.

This preempt-full series can further reduces that ~500us to ~230us per my
initial test.  More to share below.

Note that no new capability is needed, IOW it's fully compatible with the
existing preempt mode.  So the naming is actually not important but just to
identify the difference on the binaries.  It's because this series only
reworks the sender side code and does not change the migration protocol, it
just runs faster.

IOW, old "preempt" QEMU can also migrate to "preempt-full" QEMU, vice versa.

  - When old "preempt" mode QEMU migrates to "preempt-full" QEMU, it'll be
    the same as running both old "preempt" QEMUs.

  - When "preempt-full" QEMU migrates to old "preempt" QEMU, it'll be the
    same as running both "preempt-full".

The logic of the series is quite simple too: simply moving the existing
preempt channel page sends to rp-return thread.  It can slow down rp-return
thread on receiving pages, but I don't really see a major issue with it so
far.

This latency number is getting close to the extreme of 4K page request
latency of any TCP roundtrip of the 10Gbps nic I have.  The 'extreme
number' is something I get from mig_mon tool which has a mode [1] to
emulate the extreme tcp roundtrips of page requests.

Performance
===========

Page request latencies has distributions as below, with a VM of 20G mem, 20
cores, 10Gbps nic, 18G fully random writes:

Postcopy Vanilla
----------------

Average: 12093 (us)
@delay_us:
[1]                    1 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               1 |                                                    |
[32, 64)               8 |                                                    |
[64, 128)             11 |                                                    |
[128, 256)            14 |                                                    |
[256, 512)            19 |                                                    |
[512, 1K)             14 |                                                    |
[1K, 2K)              35 |                                                    |
[2K, 4K)              18 |                                                    |
[4K, 8K)              87 |@                                                   |
[8K, 16K)           2397 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)             7 |                                                    |
[32K, 64K)             2 |                                                    |
[64K, 128K)           20 |                                                    |
[128K, 256K)           6 |                                                    |

Postcopy Preempt
----------------

Average: 496 (us)

@delay_us:
[32, 64)               2 |                                                    |
[64, 128)           2306 |@@@@                                                |
[128, 256)         25422 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)          8238 |@@@@@@@@@@@@@@@@                                    |
[512, 1K)           1066 |@@                                                  |
[1K, 2K)            2167 |@@@@                                                |
[2K, 4K)            3329 |@@@@@@                                              |
[4K, 8K)             109 |                                                    |
[8K, 16K)             48 |                                                    |

Postcopy Preempt-Full
---------------------

Average: 229 (us)

@delay_us:
[8, 16)                1 |                                                    |
[16, 32)               3 |                                                    |
[32, 64)               2 |                                                    |
[64, 128)          11956 |@@@@@@@@@@                                          |
[128, 256)         60403 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)         15047 |@@@@@@@@@@@@                                        |
[512, 1K)            846 |                                                    |
[1K, 2K)              25 |                                                    |
[2K, 4K)              41 |                                                    |
[4K, 8K)             131 |                                                    |
[8K, 16K)             72 |                                                    |
[16K, 32K)             2 |                                                    |
[32K, 64K)             8 |                                                    |
[64K, 128K)            6 |                                                    |

For fully sequential page access workloads, I have described in the
previous preempt-mode work that such workload may not benefit much from
preempt mode much, but surprisingly at least in my seq write test the
preempt-full mode can also benefit sequential access patterns at least when
I measured it:

Postcopy Vanilla
----------------

Average: 1487 (us)

@delay_us:
[0]                   93 |@                                                   |
[1]                 1920 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[2, 4)               504 |@@@@@@                                              |
[4, 8)              2234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[8, 16)             4199 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            3782 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32, 64)            1016 |@@@@@@@@@@@@                                        |
[64, 128)             81 |@                                                   |
[128, 256)            14 |                                                    |
[256, 512)            26 |                                                    |
[512, 1K)             69 |                                                    |
[1K, 2K)             208 |@@                                                  |
[2K, 4K)             429 |@@@@@                                               |
[4K, 8K)            2779 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[8K, 16K)            792 |@@@@@@@@@                                           |
[16K, 32K)             9 |                                                    |

Postcopy Preempt-Full
---------------------

Average: 1582 (us)

@delay_us:
[0]                   45 |                                                    |
[1]                 1786 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[2, 4)               423 |@@@@@@@                                             |
[4, 8)              1903 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[8, 16)             2933 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[16, 32)            3132 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)             518 |@@@@@@@@                                            |
[64, 128)             30 |                                                    |
[128, 256)           218 |@@@                                                 |
[256, 512)           214 |@@@                                                 |
[512, 1K)            211 |@@@                                                 |
[1K, 2K)             131 |@@                                                  |
[2K, 4K)             336 |@@@@@                                               |
[4K, 8K)            3023 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
[8K, 16K)            479 |@@@@@@@                                             |

Postcopy Preempt-Full
---------------------

Average: 439 (us)

@delay_us:
[0]                    3 |                                                    |
[1]                 1058 |@                                                   |
[2, 4)               179 |                                                    |
[4, 8)              1079 |@                                                   |
[8, 16)             2251 |@@@                                                 |
[16, 32)            2345 |@@@@                                                |
[32, 64)             713 |@                                                   |
[64, 128)           5386 |@@@@@@@@@                                           |
[128, 256)         30252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)         10789 |@@@@@@@@@@@@@@@@@@                                  |
[512, 1K)            367 |                                                    |
[1K, 2K)              26 |                                                    |
[2K, 4K)             256 |                                                    |
[4K, 8K)            1840 |@@@                                                 |
[8K, 16K)            300 |                                                    |

I always don't think seq access is important in migrations, because for any
not-small VM that has a migration challenge, any multiple seq accesses will
also be grown into a random access pattern.  But I'm anyway laying the data
around for good reference.

Comments welcomed, thanks.

TODO List
=========

- Make migration accountings atomic
- Drop rs->f?
- Disable xbzrle for preempt mode?  Is it already perhaps disabled for postcopy?
- If this series can be really accepted, we can logically drop some of the
  old (complcated) code with the old preempt series.
- Drop x-postcopy-preempt-break-huge parameter?
- More to come

[1] https://github.com/xzpeter/mig_mon#vm-live-migration-network-emulator

Peter Xu (13):
  migration: Use non-atomic ops for clear log bitmap
  migration: Add postcopy_preempt_active()
  migration: Yield bitmap_mutex properly when sending/sleeping
  migration: Cleanup xbzrle zero page cache update logic
  migration: Disallow postcopy preempt to be used with compress
  migration: Trivial cleanup save_page_header() on same block check
  migration: Remove RAMState.f references in compression code
  migration: Teach PSS about host page
  migration: Introduce pss_channel
  migration: Add pss_init()
  migration: Make PageSearchStatus part of RAMState
  migration: Move last_sent_block into PageSearchStatus
  migration: Send requested page directly in rp-return thread

 include/exec/ram_addr.h |  11 +-
 include/qemu/bitmap.h   |   1 +
 migration/migration.c   |  11 +
 migration/ram.c         | 496 +++++++++++++++++++++++++++++-----------
 util/bitmap.c           |  45 ++++
 5 files changed, 421 insertions(+), 143 deletions(-)

-- 
2.32.0

[PATCH RFC 00/13] migration: Postcopy Preempt-Full

Reply via email to