postcopy: Report fault latencies in blocktime

Peter Xu Tue, 10 Jun 2025 09:55:39 -0700

On Tue, Jun 10, 2025 at 12:08:23AM +0000, Dr. David Alan Gilbert wrote:
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 4963f6ca12..e95b7402cb 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -236,6 +236,17 @@
> >  #     This is only present when the postcopy-blocktime migration
> >  #     capability is enabled.  (Since 3.0)
> >  #
> > +# @postcopy-latency: average remote page fault latency (in us).  Note that
> > +#     this doesn't include all faults, but only the ones that require a
> > +#     remote page request.  So it should be always bigger than the real
> > +#     average page fault latency. This is only present when the
> > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> > +#
> > +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
> > +#     us).  It has the same definition of @postcopy-latency, but instead
> > +#     this is the per-vCPU statistics.  This is only present when the
> > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> 
> I wonder if even 'us' is too big; given you have 64bits to play with, and your
> examples show some samples landing in under 10us, perhaps it's best
> to at least define the qapi  fields as ns, even if you keep with the same
> buckets for now?


The few <10us ones should pretty much be outliers, I'd expect it happened
because some faulted pages got lucky to be migrated (in the background
stream rather than the preempt stream) right after sending the request.

But it's still a fair point, especially if there's nothing to lose to
switch to nanoseconds here when we have 64bits fields.. I also did a quick
check online, looks like RDMA over 100Gbps NIC may actually do a fast
round-robin transaction within a few microseconds indeed at least with zero
loads..

Let me do the switch in v3.

While at it, when thinking of possible future unit/format changes in the
report, maybe I should also mark all of these fields experimental from the
start? So we don't necessarily need to maintain the ABI - the expectation
is even if a mgmt would like to fetch those they should only fetch and dump
it into log so that human can read later only for debugging purposes.

-- 
Peter Xu

Re: [PATCH v2 08/13] migration/postcopy: Report fault latencies in blocktime

Reply via email to