On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.pal...@oracle.com> wrote:
>
>
>
> On 8/19/25 3:10 AM, Eugenio Perez Martin wrote:
> > On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.pal...@oracle.com> 
> > wrote:
> >>
> >>
> >>
> >> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.pal...@oracle.com> 
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <pet...@redhat.com> wrote:
> >>>>>>
> >>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote:
> >>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <pet...@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> >>>>>>>>> This effort was started to reduce the guest visible downtime by
> >>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially
> >>>>>>>>> vhost-vDPA.
> >>>>>>>>>
> >>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from 
> >>>>>>>>> having to
> >>>>>>>>> migrate a lot of state but rather expensive backend control-plane 
> >>>>>>>>> latency
> >>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN 
> >>>>>>>>> filters, offload
> >>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations 
> >>>>>>>>> which
> >>>>>>>>> dominates its downtime.
> >>>>>>>>>
> >>>>>>>>> In other words, by migrating the state of virtio-net early (before 
> >>>>>>>>> the
> >>>>>>>>> stop-and-copy phase), we can also start staging backend 
> >>>>>>>>> configurations,
> >>>>>>>>> which is the main contributor of downtime when migrating a 
> >>>>>>>>> vhost-vDPA
> >>>>>>>>> device.
> >>>>>>>>>
> >>>>>>>>> I apologize if this series gives the impression that we're 
> >>>>>>>>> migrating a lot
> >>>>>>>>> of data here. It's more along the lines of moving control-plane 
> >>>>>>>>> latency out
> >>>>>>>>> of the stop-and-copy phase.
> >>>>>>>>
> >>>>>>>> I see, thanks.
> >>>>>>>>
> >>>>>>>> Please add these into the cover letter of the next post.  IMHO it's
> >>>>>>>> extremely important information to explain the real goal of this 
> >>>>>>>> work.  I
> >>>>>>>> bet it is not expected for most people when reading the current cover
> >>>>>>>> letter.
> >>>>>>>>
> >>>>>>>> Then it could have nothing to do with iterative phase, am I right?
> >>>>>>>>
> >>>>>>>> What are the data needed for the dest QEMU to start staging backend
> >>>>>>>> configurations to the HWs underneath?  Does dest QEMU already have 
> >>>>>>>> them in
> >>>>>>>> the cmdlines?
> >>>>>>>>
> >>>>>>>> Asking this because I want to know whether it can be done completely
> >>>>>>>> without src QEMU at all, e.g. when dest QEMU starts.
> >>>>>>>>
> >>>>>>>> If src QEMU's data is still needed, please also first consider 
> >>>>>>>> providing
> >>>>>>>> such facility using an "early VMSD" if it is ever possible: feel 
> >>>>>>>> free to
> >>>>>>>> refer to commit 3b95a71b22827d26178.
> >>>>>>>>
> >>>>>>>
> >>>>>>> While it works for this series, it does not allow to resend the state
> >>>>>>> when the src device changes. For example, if the number of virtqueues
> >>>>>>> is modified.
> >>>>>>
> >>>>>> Some explanation on "how sync number of vqueues helps downtime" would 
> >>>>>> help.
> >>>>>> Not "it might preheat things", but exactly why, and how that differs 
> >>>>>> when
> >>>>>> it's pure software, and when hardware will be involved.
> >>>>>>
> >>>>>
> >>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> >>>>> about ~200ms:
> >>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> >>>>>
> >>>>> Adding Dragos here in case he can provide more details. Maybe the
> >>>>> numbers have changed though.
> >>>>>
> >>>>> And I guess the difference with pure SW will always come down to PCI
> >>>>> communications, which assume it is slower than configuring the host SW
> >>>>> device in RAM or even CPU cache. But I admin that proper profiling is
> >>>>> needed before making those claims.
> >>>>>
> >>>>> Jonah, can you print the time it takes to configure the vDPA device
> >>>>> with traces vs the time it takes to enable the dataplane of the
> >>>>> device? So we can get an idea of how much time we save with this.
> >>>>>
> >>>>
> >>>> Let me know if this isn't what you're looking for.
> >>>>
> >>>> I'm assuming by "configuration time" you mean:
> >>>>     - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> >>>>       before we start enabling the vrings (e.g.
> >>>>       VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> >>>>
> >>>> And by "time taken to enable the dataplane" I'm assuming you mean:
> >>>>     - Time right before we start enabling the vrings (see above) to right
> >>>>       after we enable the last vring (at the end of
> >>>>       vhost_vdpa_net_cvq_load())
> >>>>
> >>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> >>>>
> >>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> >>>>            queues=8,x-svq=on
> >>>>
> >>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> >>>>            romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> >>>>            ctrl_vlan=off,vectors=18,host_mtu=9000,
> >>>>            disable-legacy=on,disable-modern=off
> >>>>
> >>>> ---
> >>>>
> >>>> Configuration time:    ~31s
> >>>> Dataplane enable time: ~0.14ms
> >>>>
> >>>
> >>> I was vague, but yes, that's representative enough! It would be more
> >>> accurate if the configuration time ends by the time QEMU enables the
> >>> first queue of the dataplane though.
> >>>
> >>> As Si-Wei mentions, is v->shared->listener_registered == true at the
> >>> beginning of vhost_vdpa_dev_start?
> >>>
> >>
> >> Ah, I also realized that Qemu I was using for measurements was using a
> >> version before the listener_registered member was introduced.
> >>
> >> I retested with the latest changes in Qemu and set x-svq=off, e.g.:
> >> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3
> >> times for measurements.
> >>
> >> v->shared->listener_registered == false at the beginning of
> >> vhost_vdpa_dev_start().
> >>
> >
> > Let's move out the effect of the mem pinning from the downtime by
> > registering the listener before the migration. Can you check why is it
> > not registered at vhost_vdpa_set_owner?
> >
>
> Sorry I was profiling improperly. The listener is registered at
> vhost_vdpa_set_owner initially and v->shared->listener_registered is set
> to true, but once we reach the first vhost_vdpa_dev_start call, it shows
> as false and is re-registered later in the function.
>
> Should we always expect listener_registered == true at every
> vhost_vdpa_dev_start call during startup?

Yes, that leaves all the memory pinning time out of the downtime.

> This is what I traced during
> startup of a single guest (no migration).

We can trace the destination's QEMU to be more accurate, but probably
it makes no difference.

> Tracepoint is right at the
> start of the vhost_vdpa_dev_start function:
>
> vhost_vdpa_set_owner() - register memory listener
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1

This is surprising. Can you trace how listener_registered goes to 0 again?

> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1
> ...
> * VQs are now being enabled *
>
> I'm also seeing that when the guest is being shutdown,
> dev->vhost_ops->vhost_get_vring_base() is failing in
> do_vhost_virtqueue_stop():
>
> ...
> [  114.718429] systemd-shutdown[1]: Syncing filesystems and block devices.
> [  114.719255] systemd-shutdown[1]: Powering off.
> [  114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [  114.724826] ACPI: PM: Preparing to enter system sleep state S5
> [  114.725593] reboot: Power down
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
> qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not
> permitted (1)
> qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not
> permitted (1)
> vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0
>
> However when x-svq=on, I don't see these errors on shutdown.
>

SVQ can mask this error as it does not need to forward the ring
restore message to the device. It can just start with 0 and convert
indexes.

Let's focus on listened_registered first :).

> >> ---
> >>
> >> Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> >> right after Qemu enables the first VQ.
> >>    - 26.947s, 26.606s, 27.326s
> >>
> >> Enable dataplane: Time from right after first VQ is enabled to right
> >> after the last VQ is enabled.
> >>    - 0.081ms, 0.081ms, 0.079ms
> >>
> >
>


Reply via email to