On Tue, Aug 19, 2025 at 5:11 PM Jonah Palmer <jonah.pal...@oracle.com> wrote: > > > > On 8/19/25 3:10 AM, Eugenio Perez Martin wrote: > > On Mon, Aug 18, 2025 at 4:46 PM Jonah Palmer <jonah.pal...@oracle.com> > > wrote: > >> > >> > >> > >> On 8/18/25 2:51 AM, Eugenio Perez Martin wrote: > >>> On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.pal...@oracle.com> > >>> wrote: > >>>> > >>>> > >>>> > >>>> On 8/14/25 5:28 AM, Eugenio Perez Martin wrote: > >>>>> On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <pet...@redhat.com> wrote: > >>>>>> > >>>>>> On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin wrote: > >>>>>>> On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <pet...@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote: > >>>>>>>>> This effort was started to reduce the guest visible downtime by > >>>>>>>>> virtio-net/vhost-net/vhost-vDPA during live migration, especially > >>>>>>>>> vhost-vDPA. > >>>>>>>>> > >>>>>>>>> The downtime contributed by vhost-vDPA, for example, is not from > >>>>>>>>> having to > >>>>>>>>> migrate a lot of state but rather expensive backend control-plane > >>>>>>>>> latency > >>>>>>>>> like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN > >>>>>>>>> filters, offload > >>>>>>>>> settings, MTU, etc.). Doing this requires kernel/HW NIC operations > >>>>>>>>> which > >>>>>>>>> dominates its downtime. > >>>>>>>>> > >>>>>>>>> In other words, by migrating the state of virtio-net early (before > >>>>>>>>> the > >>>>>>>>> stop-and-copy phase), we can also start staging backend > >>>>>>>>> configurations, > >>>>>>>>> which is the main contributor of downtime when migrating a > >>>>>>>>> vhost-vDPA > >>>>>>>>> device. > >>>>>>>>> > >>>>>>>>> I apologize if this series gives the impression that we're > >>>>>>>>> migrating a lot > >>>>>>>>> of data here. It's more along the lines of moving control-plane > >>>>>>>>> latency out > >>>>>>>>> of the stop-and-copy phase. > >>>>>>>> > >>>>>>>> I see, thanks. > >>>>>>>> > >>>>>>>> Please add these into the cover letter of the next post. IMHO it's > >>>>>>>> extremely important information to explain the real goal of this > >>>>>>>> work. I > >>>>>>>> bet it is not expected for most people when reading the current cover > >>>>>>>> letter. > >>>>>>>> > >>>>>>>> Then it could have nothing to do with iterative phase, am I right? > >>>>>>>> > >>>>>>>> What are the data needed for the dest QEMU to start staging backend > >>>>>>>> configurations to the HWs underneath? Does dest QEMU already have > >>>>>>>> them in > >>>>>>>> the cmdlines? > >>>>>>>> > >>>>>>>> Asking this because I want to know whether it can be done completely > >>>>>>>> without src QEMU at all, e.g. when dest QEMU starts. > >>>>>>>> > >>>>>>>> If src QEMU's data is still needed, please also first consider > >>>>>>>> providing > >>>>>>>> such facility using an "early VMSD" if it is ever possible: feel > >>>>>>>> free to > >>>>>>>> refer to commit 3b95a71b22827d26178. > >>>>>>>> > >>>>>>> > >>>>>>> While it works for this series, it does not allow to resend the state > >>>>>>> when the src device changes. For example, if the number of virtqueues > >>>>>>> is modified. > >>>>>> > >>>>>> Some explanation on "how sync number of vqueues helps downtime" would > >>>>>> help. > >>>>>> Not "it might preheat things", but exactly why, and how that differs > >>>>>> when > >>>>>> it's pure software, and when hardware will be involved. > >>>>>> > >>>>> > >>>>> By nvidia engineers to configure vqs (number, size, RSS, etc) takes > >>>>> about ~200ms: > >>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$ > >>>>> > >>>>> Adding Dragos here in case he can provide more details. Maybe the > >>>>> numbers have changed though. > >>>>> > >>>>> And I guess the difference with pure SW will always come down to PCI > >>>>> communications, which assume it is slower than configuring the host SW > >>>>> device in RAM or even CPU cache. But I admin that proper profiling is > >>>>> needed before making those claims. > >>>>> > >>>>> Jonah, can you print the time it takes to configure the vDPA device > >>>>> with traces vs the time it takes to enable the dataplane of the > >>>>> device? So we can get an idea of how much time we save with this. > >>>>> > >>>> > >>>> Let me know if this isn't what you're looking for. > >>>> > >>>> I'm assuming by "configuration time" you mean: > >>>> - Time from device startup (entry to vhost_vdpa_dev_start()) to right > >>>> before we start enabling the vrings (e.g. > >>>> VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()). > >>>> > >>>> And by "time taken to enable the dataplane" I'm assuming you mean: > >>>> - Time right before we start enabling the vrings (see above) to right > >>>> after we enable the last vring (at the end of > >>>> vhost_vdpa_net_cvq_load()) > >>>> > >>>> Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs: > >>>> > >>>> -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0, > >>>> queues=8,x-svq=on > >>>> > >>>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1, > >>>> romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on, > >>>> ctrl_vlan=off,vectors=18,host_mtu=9000, > >>>> disable-legacy=on,disable-modern=off > >>>> > >>>> --- > >>>> > >>>> Configuration time: ~31s > >>>> Dataplane enable time: ~0.14ms > >>>> > >>> > >>> I was vague, but yes, that's representative enough! It would be more > >>> accurate if the configuration time ends by the time QEMU enables the > >>> first queue of the dataplane though. > >>> > >>> As Si-Wei mentions, is v->shared->listener_registered == true at the > >>> beginning of vhost_vdpa_dev_start? > >>> > >> > >> Ah, I also realized that Qemu I was using for measurements was using a > >> version before the listener_registered member was introduced. > >> > >> I retested with the latest changes in Qemu and set x-svq=off, e.g.: > >> guest specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3 > >> times for measurements. > >> > >> v->shared->listener_registered == false at the beginning of > >> vhost_vdpa_dev_start(). > >> > > > > Let's move out the effect of the mem pinning from the downtime by > > registering the listener before the migration. Can you check why is it > > not registered at vhost_vdpa_set_owner? > > > > Sorry I was profiling improperly. The listener is registered at > vhost_vdpa_set_owner initially and v->shared->listener_registered is set > to true, but once we reach the first vhost_vdpa_dev_start call, it shows > as false and is re-registered later in the function. > > Should we always expect listener_registered == true at every > vhost_vdpa_dev_start call during startup?
Yes, that leaves all the memory pinning time out of the downtime. > This is what I traced during > startup of a single guest (no migration). We can trace the destination's QEMU to be more accurate, but probably it makes no difference. > Tracepoint is right at the > start of the vhost_vdpa_dev_start function: > > vhost_vdpa_set_owner() - register memory listener > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 This is surprising. Can you trace how listener_registered goes to 0 again? > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > vhost_vdpa_dev_start() - v->shared->listener_registered = 0, started = 1 > ... > * VQs are now being enabled * > > I'm also seeing that when the guest is being shutdown, > dev->vhost_ops->vhost_get_vring_base() is failing in > do_vhost_virtqueue_stop(): > > ... > [ 114.718429] systemd-shutdown[1]: Syncing filesystems and block devices. > [ 114.719255] systemd-shutdown[1]: Powering off. > [ 114.719916] sd 0:0:0:0: [sda] Synchronizing SCSI cache > [ 114.724826] ACPI: PM: Preparing to enter system sleep state S5 > [ 114.725593] reboot: Power down > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 2 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 3 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 4 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 5 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 6 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 7 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 8 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 9 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 10 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 11 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 12 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 13 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > qemu-system-x86_64: vhost VQ 14 ring restore failed: -1: Operation not > permitted (1) > qemu-system-x86_64: vhost VQ 15 ring restore failed: -1: Operation not > permitted (1) > vhost_vdpa_dev_start() - v->shared->listener_registered = 1, started = 0 > > However when x-svq=on, I don't see these errors on shutdown. > SVQ can mask this error as it does not need to forward the ring restore message to the device. It can just start with 0 and convert indexes. Let's focus on listened_registered first :). > >> --- > >> > >> Configuration time: Time from first entry into vhost_vdpa_dev_start() to > >> right after Qemu enables the first VQ. > >> - 26.947s, 26.606s, 27.326s > >> > >> Enable dataplane: Time from right after first VQ is enabled to right > >> after the last VQ is enabled. > >> - 0.081ms, 0.081ms, 0.079ms > >> > > >