This series is based on the other series I posted here: Based-on: <20250609161855.6603-1-pet...@redhat.com> https://lore.kernel.org/r/20250609161855.6603-1-pet...@redhat.com
v1: https://lore.kernel.org/r/20250527231248.1279174-1-pet...@redhat.com v2: https://lore.kernel.org/r/20250609191259.9053-1-pet...@redhat.com v3 changelog: - Switch to nanoseconds across for the whole patchset [Dave] NOTE: many patches need small touch ups on the conversion of units and rebase, I kept the tags. - Mark all new blocktime fields experimental in QMP The expected use case is having mgmt query the results and dump them into log only for debugging purposes (rather than parsing them, as of now). Chose this way to suggest userapp not parsing them, meanwhile more flexible. - Added the other patch to add latency buckets into this series - Fixes spots on checkpatch issues - Added Tested-by tags for Mario for relevant patches Overview ======== This series almost rewrites the blocktime feature. It is a postcopy feature which can track how long a vCPU got blocked, and how long the whole system got blocked. I'm wildly guessing most people are not aware of it, or tried to use it. Recently, when I was doing some postcopy tests using my normal scripts for traps, I remembered once again that we have the blocktime feature. I decided to bite the bullet this time to make this feature more useful. The feature in general isn't extremely helpful before. One major reason might be the existance of KVM async page fault (which is also on by default in many environments). KVM async page fault allows guest OSes to schedule some guest threads out when they're accessing a missing page. It means the vCPUs will _not_ be blocked even if they're accessing a missing page. However blocktime reporting all close to zeros may not really mean the guest is not affected: the workload is definitely impacted. That's also why I normally measure page fault latencies instead for most postcopy tests, because they're more critical to me. Said that, the blocktime layer is actually close to what we want to do here, on trapping fault latencies too. So I added it in this series. Also I added non-vCPU threads tracking too, so that at least the latency results are valid even if KVM async PF is ON. When at it, I found so many things missing in this feature. I tackled every single one of them with separate patches. One can have a quick look at what has changed in the follow up section. Major Changes ============= - Locking refactor: remove atomic ops, rely on page request mutex instead It used to rely on atomic ops but probably not correct either.. I have a paragraph in patch "migration/postcopy: Optimize blocktime fault tracking with hashtable" explaining why it is buggy. - Extend all blocktime records internally from 32 bits to 64 bits Do this almost to support nanoseconds trackings (which used to be in milliseconds). Note this does not change existing results reported in QMP in the past because it is ABI, however we'll use nanoseconds for any new results to be reported later. - Added support to report fault average latencies (globally, per-vCPU, non-vCPU), results are in nanoseconds. - Initialize blocktime reliably only when POSTCOPY_LISTEN Rather than hack-init it when creating any userfaultfd.. - Provide tid->vcpu cache Add a quick cache for tid->vcpu hash mapping. It used to be a for loop looking for the CPU index, which is unwanted. - Replace fault record arrays with a hashtable This is an optimization for fast inject/lookup of fault records. Again, it used to be yet another array keeping all vCPU data. It's not only less performant when vCPUs can be a lot (especially on the lookups, which is another for loop..), but also buggy. It's because in reality each vCPU can receive more than one fault sometimes. Please see the patch "migration/postcopy: Optimize blocktime fault tracking with hashtable" for more information. - Added support for non-vCPU fault trackings This will be extremely useful when e.g. KVM async page fault is enabled, because vCPUs almost never block. - Added latency distribution in power-of-two buckets It's the last patch, it's collected from a separate post from: https://lore.kernel.org/all/20250609223607.34387-1-pet...@redhat.com/ Test Results ============ I did quite some tests with the feature after the rewrote. It looks pretty well so far and I plan to throw my scripts away until proven more useful. I was testing on a 80 vCPUs VM with 16GB memory, best I can find at hand. The page latencies overhead is almost negligible: Disabled: Average: 236.00 (+-4.66%) Enabled: Average: 232.67 (+-2.01%) It was average results out of three runs for each. Enabling the feature even makes the latency to be smaller? Well that's probably noise.. Surprisingly, I did also try the old code, the overhead is also almost not mesurable comparing to the faults. I guess it's because a few "for loop"s over some arrays of 80 elements isn't much of a hurdle when the round trip is still ~200us. But still, it's not only about perf, but other things too, on such rewrites. For example, arrays won't be able to trap non-vCPU faults. The hash is still pretty much needed, one way or another. I think it means we should be able to enable this feature altogether with postcopy-ram if we want, and unless the VM is extremely huge we shouldn't expect much overhead. Actually I'd bet it should work all fine even with hundreds of vCPUs especially after this rewrite. If add preempt mode into picture, I think one should enable all three features for postcopy by default at some point. Comments welcomed, thanks. Peter Xu (14): migration: Add option to set postcopy-blocktime migration/postcopy: Push blocktime start/end into page req mutex migration/postcopy: Drop all atomic ops in blocktime feature migration/postcopy: Make all blocktime vars 64bits migration/postcopy: Drop PostcopyBlocktimeContext.start_time migration/postcopy: Bring blocktime layer to ns level migration/postcopy: Add blocktime fault counts per-vcpu migration/postcopy: Report fault latencies in blocktime migration/postcopy: Initialize blocktime context only until listen migration/postcopy: Cache the tid->vcpu mapping for blocktime migration/postcopy: Cleanup the total blocktime accounting migration/postcopy: Optimize blocktime fault tracking with hashtable migration/postcopy: blocktime allows track / report non-vCPU faults migration/postcopy: Add latency distribution report for blocktime qapi/migration.json | 38 ++ migration/migration.h | 2 +- migration/postcopy-ram.h | 2 + migration/migration-hmp-cmds.c | 104 ++++- migration/migration.c | 25 +- migration/options.c | 2 + migration/postcopy-ram.c | 563 ++++++++++++++++++++------ tests/qtest/migration/migration-qmp.c | 5 + migration/trace-events | 8 +- 9 files changed, 593 insertions(+), 156 deletions(-) -- 2.49.0