This series is based on the other series I posted here:

Based-on: <20250527215850.1271072-1-pet...@redhat.com>
https://lore.kernel.org/r/20250527215850.1271072-1-pet...@redhat.com

Overview
========

This series almost rewrites the blocktime feature.  It is a postcopy
feature which can track how long a vCPU got blocked, and how long the whole
system got blocked.  I'm wildly guessing most people are not aware of it,
or tried to use it.

Recently, when I was doing some postcopy tests using my normal scripts for
traps, I remembered once again that we have the blocktime feature.  I
decided to bite the bullet this time to make this feature more useful.

The feature in general isn't extremely helpful before.  One major reason
might be the existance of KVM async page fault (which is also on by default
in many environments).  KVM async page fault allows guest OSes to schedule
some guest threads out when they're accessing a missing page.  It means the
vCPUs will _not_ be blocked even if they're accessing a missing page.
However blocktime reporting all close to zeros may not really mean the
guest is not affected: the workload is definitely impacted.  That's also
why I normally measure page fault latencies instead for most postcopy
tests, because they're more critical to me.

Said that, the blocktime layer is actually close to what we want to do
here, on trapping fault latencies too.  So I added it in this series.  Also
I added non-vCPU threads tracking too, so that at least the latency results
are valid even if KVM async PF is ON.

When at it, I found so many things missing in this feature.  I tackled
every single one of them with separate patches.  One can have a quick look
at what has changed in the follow up section.

Major Changes
=============

- Locking refactor: remove atomic ops, rely on page request mutex instead

  It used to rely on atomic ops but probably not correct either..  I have a
  paragraph in patch "migration/postcopy: Optimize blocktime fault tracking
  with hashtable" explaining why it is buggy.

- Extend all blocktime records internally from 32 bits to 64 bits

  Do this almost to support microsecond trackings (which used to be in
  milliseconds).  Note this does not change existing results reported in
  QMP in the past because it is ABI, however we'll use microseconds for any
  new results to be reported later.

- Added support to report fault average latencies (globally, per-vCPU,
  non-vCPU), results are in microseconds.

- Initialize blocktime reliably only when POSTCOPY_LISTEN

  Rather than hack-init it when creating any userfaultfd..

- Provide tid->vcpu cache

  Add a quick cache for tid->vcpu hash mapping.  It used to be a for loop
  looking for the CPU index, which is unwanted.

- Replace fault record arrays with a hashtable

  This is an optimization for fast inject/lookup of fault records.

  Again, it used to be yet another array keeping all vCPU data.  It's not
  only less performant when vCPUs can be a lot (especially on the lookups,
  which is another for loop..), but also buggy.  It's because in reality
  each vCPU can receive more than one fault sometimes.

  Please see the patch "migration/postcopy: Optimize blocktime fault
  tracking with hashtable" for more information.

- Added support for non-vCPU fault trackings

  This will be extremely useful when e.g. KVM async page fault is enabled,
  because vCPUs almost never block.

Test Results
============

I did quite some tests with the feature after the rewrote.  It looks pretty
well so far and I plan to throw my scripts away until proven more useful.

I was testing on a 80 vCPUs VM with 16GB memory, best I can find at hand.
The page latencies overhead is almost negligible:

  Disabled:
  Average: 236.00 (+-4.66%)

  Enabled:
  Average: 232.67 (+-2.01%)

It was average results out of three runs for each.  Enabling the feature
even makes the latency to be smaller?  Well that's probably noise..

Surprisingly, I did also try the old code, the overhead is also almost not
mesurable comparing to the faults.  I guess it's because a few "for loop"s
over some arrays of 80 elements isn't much of a hurdle when the round trip
is still ~200us.  But still, it's not only about perf, but other things
too, on such rewrites.  For example, arrays won't be able to trap non-vCPU
faults.  The hash is still pretty much needed, one way or another.

I think it means we should be able to enable this feature altogether with
postcopy-ram if we want, and unless the VM is extremely huge we shouldn't
expect much overhead.  Actually I'd bet it should work all fine even with
hundreds of vCPUs especially after this rewrite.  If add preempt mode into
picture, I think one should enable all three features for postcopy by
default at some point.

TODO: we may also add buckets to provide better statistics for the latency
report, e.g. how many faults were resolved in (2^N, 2^(N+1)) us window, etc.

Comments welcomed, thanks.

Peter Xu (13):
  migration: Add option to set postcopy-blocktime
  migration/postcopy: Push blocktime start/end into page req mutex
  migration/postcopy: Drop all atomic ops in blocktime feature
  migration/postcopy: Make all blocktime vars 64bits
  migration/postcopy: Drop PostcopyBlocktimeContext.start_time
  migration/postcopy: Bring blocktime layer to us level
  migration/postcopy: Add blocktime fault counts per-vcpu
  migration/postcopy: Report fault latencies in blocktime
  migration/postcopy: Initialize blocktime context only until listen
  migration/postcopy: Cache the tid->vcpu mapping for blocktime
  migration/postcopy: Cleanup the total blocktime accounting
  migration/postcopy: Optimize blocktime fault tracking with hashtable
  migration/postcopy: blocktime allows track / report non-vCPU faults

 qapi/migration.json                   |  20 +
 migration/migration.h                 |   2 +-
 migration/postcopy-ram.h              |   2 +
 migration/migration-hmp-cmds.c        |  75 ++--
 migration/migration.c                 |  24 +-
 migration/options.c                   |   2 +
 migration/postcopy-ram.c              | 518 ++++++++++++++++++++------
 tests/qtest/migration/migration-qmp.c |   4 +
 migration/trace-events                |   8 +-
 9 files changed, 497 insertions(+), 158 deletions(-)

-- 
2.49.0


Reply via email to