postcopy: Report fault latencies in blocktime

Markus Armbruster Mon, 02 Jun 2025 02:34:33 -0700

Peter Xu <pet...@redhat.com> writes:

> Blocktime so far only cares about the time one vcpu (or the whole system)
> got blocked.  It would be also be helpful if it can also report the latency
> of page requests, which could be very sensitive during postcopy.
>
> Blocktime itself is sometimes not very important, especially when one
> thinks about KVM async PF support, which means vCPUs are literally almost
> not blocked at all because the guest OS is smart enough to switch to
> another task when a remote fault is needed.
>
> However, latency is still sensitive and important because even if the guest
> vCPU is running on threads that do not need a remote fault, the workload
> that accesses some missing page is still affected.
>
> Add two entries to the report, showing how long it takes to resolve a
> remote fault.  Mention in the QAPI doc that this is not the real average
> fault latency, but only the ones that was requested for a remote fault.
>
> Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
> meanwhile add the entry checks in qtests for all postcopy tests.
>
> Cc: Markus Armbruster <arm...@redhat.com>
> Cc: Dr. David Alan Gilbert <d...@treblig.org>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> ---
>  qapi/migration.json                   | 13 +++++
>  migration/migration-hmp-cmds.c        | 70 ++++++++++++++++++---------
>  migration/postcopy-ram.c              | 48 ++++++++++++------
>  tests/qtest/migration/migration-qmp.c |  3 ++
>  4 files changed, 97 insertions(+), 37 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8b9c53595c..8b13cea169 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -236,6 +236,17 @@
>  #     This is only present when the postcopy-blocktime migration
>  #     capability is enabled.  (Since 3.0)
>  #
> +# @postcopy-latency: average remote page fault latency (in us).  Note that
> +#     this doesn't include all faults, but only the ones that require a
> +#     remote page request.  So it should be always bigger than the real
> +#     average page fault latency. This is only present when the
> +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> +#
> +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
> +#     us).  It has the same definition of @postcopy-latency, but instead
> +#     this is the per-vCPU statistics. This is only present when the


Two spaces between sentences for consistency, please.

> +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)

I figure the the @i-th array element is for vCPU with index @i.  Correct?

This is also only present when @postcopy-blocktime is enabled.  Correct?

Could a QMP client compute @postcopy-latency from
@postcopy-vcpu-latency?

> +#
>  # @socket-address: Only used for tcp, to know what the real port is
>  #     (Since 4.0)
>  #
> @@ -275,6 +286,8 @@
>             '*blocked-reasons': ['str'],
>             '*postcopy-blocktime': 'uint32',
>             '*postcopy-vcpu-blocktime': ['uint32'],
> +           '*postcopy-latency': 'uint64',
> +           '*postcopy-vcpu-latency': ['uint64'],
>             '*socket-address': ['SocketAddress'],
>             '*dirty-limit-throttle-time-per-round': 'uint64',
>             '*dirty-limit-ring-full-time': 'uint64'} }

[...]

Re: [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime

Reply via email to