postcopy: Report fault latencies in blocktime

Dr. David Alan Gilbert Tue, 10 Jun 2025 09:53:06 -0700

* Peter Xu (pet...@redhat.com) wrote:
> On Tue, Jun 10, 2025 at 12:08:23AM +0000, Dr. David Alan Gilbert wrote:
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 4963f6ca12..e95b7402cb 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -236,6 +236,17 @@
> > >  #     This is only present when the postcopy-blocktime migration
> > >  #     capability is enabled.  (Since 3.0)
> > >  #
> > > +# @postcopy-latency: average remote page fault latency (in us).  Note 
> > > that
> > > +#     this doesn't include all faults, but only the ones that require a
> > > +#     remote page request.  So it should be always bigger than the real
> > > +#     average page fault latency. This is only present when the
> > > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> > > +#
> > > +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
> > > +#     us).  It has the same definition of @postcopy-latency, but instead
> > > +#     this is the per-vCPU statistics.  This is only present when the
> > > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> > 
> > I wonder if even 'us' is too big; given you have 64bits to play with, and 
> > your
> > examples show some samples landing in under 10us, perhaps it's best
> > to at least define the qapi  fields as ns, even if you keep with the same
> > buckets for now?
> 
> The few <10us ones should pretty much be outliers, I'd expect it happened
> because some faulted pages got lucky to be migrated (in the background
> stream rather than the preempt stream) right after sending the request.
> 
> But it's still a fair point, especially if there's nothing to lose to
> switch to nanoseconds here when we have 64bits fields.. I also did a quick
> check online, looks like RDMA over 100Gbps NIC may actually do a fast
> round-robin transaction within a few microseconds indeed at least with zero
> loads..
> 
> Let me do the switch in v3.
> 
> While at it, when thinking of possible future unit/format changes in the
> report, maybe I should also mark all of these fields experimental from the
> start? So we don't necessarily need to maintain the ABI - the expectation
> is even if a mgmt would like to fetch those they should only fetch and dump
> it into log so that human can read later only for debugging purposes.


Yeh I think that's OK, although perhaps another way would be to add
a field indicating the time of the first bucket; i.e. you could specify
that all the values are in ns, but have first-bucket=1000 to be exactly
the same as you have it now.

Dave

> -- 
> Peter Xu
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

Re: [PATCH v2 08/13] migration/postcopy: Report fault latencies in blocktime

Reply via email to