This series is based on the other series I posted here:

Based-on: <20250609161855.6603-1-pet...@redhat.com>
https://lore.kernel.org/r/20250609161855.6603-1-pet...@redhat.com

v1: https://lore.kernel.org/r/20250527231248.1279174-1-pet...@redhat.com
v2: https://lore.kernel.org/r/20250609191259.9053-1-pet...@redhat.com

v3 changelog:
- Switch to nanoseconds across for the whole patchset [Dave]
  NOTE: many patches need small touch ups on the conversion of units and
  rebase, I kept the tags.
- Mark all new blocktime fields experimental in QMP
  The expected use case is having mgmt query the results and dump them into
  log only for debugging purposes (rather than parsing them, as of now).
  Chose this way to suggest userapp not parsing them, meanwhile more flexible.
- Added the other patch to add latency buckets into this series
- Fixes spots on checkpatch issues
- Added Tested-by tags for Mario for relevant patches

Overview
========

This series almost rewrites the blocktime feature.  It is a postcopy
feature which can track how long a vCPU got blocked, and how long the whole
system got blocked.  I'm wildly guessing most people are not aware of it,
or tried to use it.

Recently, when I was doing some postcopy tests using my normal scripts for
traps, I remembered once again that we have the blocktime feature.  I
decided to bite the bullet this time to make this feature more useful.

The feature in general isn't extremely helpful before.  One major reason
might be the existance of KVM async page fault (which is also on by default
in many environments).  KVM async page fault allows guest OSes to schedule
some guest threads out when they're accessing a missing page.  It means the
vCPUs will _not_ be blocked even if they're accessing a missing page.
However blocktime reporting all close to zeros may not really mean the
guest is not affected: the workload is definitely impacted.  That's also
why I normally measure page fault latencies instead for most postcopy
tests, because they're more critical to me.

Said that, the blocktime layer is actually close to what we want to do
here, on trapping fault latencies too.  So I added it in this series.  Also
I added non-vCPU threads tracking too, so that at least the latency results
are valid even if KVM async PF is ON.

When at it, I found so many things missing in this feature.  I tackled
every single one of them with separate patches.  One can have a quick look
at what has changed in the follow up section.

Major Changes
=============

- Locking refactor: remove atomic ops, rely on page request mutex instead

  It used to rely on atomic ops but probably not correct either..  I have a
  paragraph in patch "migration/postcopy: Optimize blocktime fault tracking
  with hashtable" explaining why it is buggy.

- Extend all blocktime records internally from 32 bits to 64 bits

  Do this almost to support nanoseconds trackings (which used to be in
  milliseconds).  Note this does not change existing results reported in
  QMP in the past because it is ABI, however we'll use nanoseconds for any
  new results to be reported later.

- Added support to report fault average latencies (globally, per-vCPU,
  non-vCPU), results are in nanoseconds.

- Initialize blocktime reliably only when POSTCOPY_LISTEN

  Rather than hack-init it when creating any userfaultfd..

- Provide tid->vcpu cache

  Add a quick cache for tid->vcpu hash mapping.  It used to be a for loop
  looking for the CPU index, which is unwanted.

- Replace fault record arrays with a hashtable

  This is an optimization for fast inject/lookup of fault records.

  Again, it used to be yet another array keeping all vCPU data.  It's not
  only less performant when vCPUs can be a lot (especially on the lookups,
  which is another for loop..), but also buggy.  It's because in reality
  each vCPU can receive more than one fault sometimes.

  Please see the patch "migration/postcopy: Optimize blocktime fault
  tracking with hashtable" for more information.

- Added support for non-vCPU fault trackings

  This will be extremely useful when e.g. KVM async page fault is enabled,
  because vCPUs almost never block.

- Added latency distribution in power-of-two buckets

  It's the last patch, it's collected from a separate post from:
  https://lore.kernel.org/all/20250609223607.34387-1-pet...@redhat.com/

Test Results
============

I did quite some tests with the feature after the rewrote.  It looks pretty
well so far and I plan to throw my scripts away until proven more useful.

I was testing on a 80 vCPUs VM with 16GB memory, best I can find at hand.
The page latencies overhead is almost negligible:

  Disabled:
  Average: 236.00 (+-4.66%)

  Enabled:
  Average: 232.67 (+-2.01%)

It was average results out of three runs for each.  Enabling the feature
even makes the latency to be smaller?  Well that's probably noise..

Surprisingly, I did also try the old code, the overhead is also almost not
mesurable comparing to the faults.  I guess it's because a few "for loop"s
over some arrays of 80 elements isn't much of a hurdle when the round trip
is still ~200us.  But still, it's not only about perf, but other things
too, on such rewrites.  For example, arrays won't be able to trap non-vCPU
faults.  The hash is still pretty much needed, one way or another.

I think it means we should be able to enable this feature altogether with
postcopy-ram if we want, and unless the VM is extremely huge we shouldn't
expect much overhead.  Actually I'd bet it should work all fine even with
hundreds of vCPUs especially after this rewrite.  If add preempt mode into
picture, I think one should enable all three features for postcopy by
default at some point.

Comments welcomed, thanks.

Peter Xu (14):
  migration: Add option to set postcopy-blocktime
  migration/postcopy: Push blocktime start/end into page req mutex
  migration/postcopy: Drop all atomic ops in blocktime feature
  migration/postcopy: Make all blocktime vars 64bits
  migration/postcopy: Drop PostcopyBlocktimeContext.start_time
  migration/postcopy: Bring blocktime layer to ns level
  migration/postcopy: Add blocktime fault counts per-vcpu
  migration/postcopy: Report fault latencies in blocktime
  migration/postcopy: Initialize blocktime context only until listen
  migration/postcopy: Cache the tid->vcpu mapping for blocktime
  migration/postcopy: Cleanup the total blocktime accounting
  migration/postcopy: Optimize blocktime fault tracking with hashtable
  migration/postcopy: blocktime allows track / report non-vCPU faults
  migration/postcopy: Add latency distribution report for blocktime

 qapi/migration.json                   |  38 ++
 migration/migration.h                 |   2 +-
 migration/postcopy-ram.h              |   2 +
 migration/migration-hmp-cmds.c        | 104 ++++-
 migration/migration.c                 |  25 +-
 migration/options.c                   |   2 +
 migration/postcopy-ram.c              | 563 ++++++++++++++++++++------
 tests/qtest/migration/migration-qmp.c |   5 +
 migration/trace-events                |   8 +-
 9 files changed, 593 insertions(+), 156 deletions(-)

-- 
2.49.0


Reply via email to