Re: [PATCH v3 0/7] Move memory listener register to vhost_vdpa_init

Lei Yang Fri, 04 Apr 2025 11:33:09 -0700

On Thu, Mar 20, 2025 at 11:48 PM Dragos Tatulea <dtatu...@nvidia.com> wrote:
>
> Hi Lei,
>
> On 03/20, Lei Yang wrote:
> > Hi Dragos, Si-Wei
> >
> > 1.  I applied [0] [1] [2] to the downstream kernel then tested
> > hotplug/unplug, this bug still exists.
> >
> > [0] 35025963326e ("vdpa/mlx5: Fix suboptimal range on iotlb iteration")
> > [1] 29ce8b8a4fa7 ("vdpa/mlx5: Fix PA offset with unaligned starting iotlb 
> > map")
> > [2] a6097e0a54a5 ("vdpa/mlx5: Fix oversized null mkey longer than 32bit")
> >
> > 2. Si-Wei mentioned two patches [1] [2] have been merged into qemu
> > master branch, so based on the test result it can not help fix this
> > bug.
> > [1] db0d4017f9b9 ("net: parameterize the removing client from nc list")
> > [2] e7891c575fb2 ("net: move backend cleanup to NIC cleanup")
> >
> > 3. I found triggers for the unhealthy report from firmware step is
> > just boot up guest when using the current patches qemu. The host dmesg
> > will print  unhealthy info immediately after booting up the guest.
> >


Hi Dragos

> Did you set the locked memory to ulimite before (ulimit -l unlimited)?
> This could also be the cause for the FW issue.

Yes, I did it. I executed it (ulimit -l unlimited) before I boot up the guest.

Thanks
Lei
>
> Thanks,
> Dragos
>
> > Thanks
> > Lei
> >
> >
> > On Wed, Mar 19, 2025 at 8:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> > >
> > > Hi Lei,
> > >
> > > On 3/18/2025 7:06 AM, Lei Yang wrote:
> > > > On Tue, Mar 18, 2025 at 10:15 AM Jason Wang <jasow...@redhat.com> wrote:
> > > >> On Tue, Mar 18, 2025 at 9:55 AM Lei Yang <leiy...@redhat.com> wrote:
> > > >>> Hi Jonah
> > > >>>
> > > >>> I tested this series with the vhost_vdpa device based on mellanox
> > > >>> ConnectX-6 DX nic and hit the host kernel crash. This problem can be
> > > >>> easier to reproduce under the hotplug/unplug device scenario.
> > > >>> For the core dump messages please review the attachment.
> > > >>> FW version:
> > > >>> #  flint -d 0000:0d:00.0 q |grep Version
> > > >>> FW Version:            22.44.1036
> > > >>> Product Version:       22.44.1036
> > > >> The trace looks more like a mlx5e driver bug other than vDPA?
> > > >>
> > > >> [ 3256.256707] Call Trace:
> > > >> [ 3256.256708]  <IRQ>
> > > >> [ 3256.256709]  ? show_trace_log_lvl+0x1c4/0x2df
> > > >> [ 3256.256714]  ? show_trace_log_lvl+0x1c4/0x2df
> > > >> [ 3256.256715]  ? __build_skb+0x4a/0x60
> > > >> [ 3256.256719]  ? __die_body.cold+0x8/0xd
> > > >> [ 3256.256720]  ? die_addr+0x39/0x60
> > > >> [ 3256.256725]  ? exc_general_protection+0x1ec/0x420
> > > >> [ 3256.256729]  ? asm_exc_general_protection+0x22/0x30
> > > >> [ 3256.256736]  ? __build_skb_around+0x8c/0xf0
> > > >> [ 3256.256738]  __build_skb+0x4a/0x60
> > > >> [ 3256.256740]  build_skb+0x11/0xa0
> > > >> [ 3256.256743]  mlx5e_skb_from_cqe_mpwrq_linear+0x156/0x280 [mlx5_core]
> > > >> [ 3256.256872]  mlx5e_handle_rx_cqe_mpwrq_rep+0xcb/0x1e0 [mlx5_core]
> > > >> [ 3256.256964]  mlx5e_rx_cq_process_basic_cqe_comp+0x39f/0x3c0 
> > > >> [mlx5_core]
> > > >> [ 3256.257053]  mlx5e_poll_rx_cq+0x3a/0xc0 [mlx5_core]
> > > >> [ 3256.257139]  mlx5e_napi_poll+0xe2/0x710 [mlx5_core]
> > > >> [ 3256.257226]  __napi_poll+0x29/0x170
> > > >> [ 3256.257229]  net_rx_action+0x29c/0x370
> > > >> [ 3256.257231]  handle_softirqs+0xce/0x270
> > > >> [ 3256.257236]  __irq_exit_rcu+0xa3/0xc0
> > > >> [ 3256.257238]  common_interrupt+0x80/0xa0
> > > >>
> > > > Hi Jason
> > > >
> > > >> Which kernel tree did you use? Can you please try net.git?
> > > > I used the latest 9.6 downstream kernel and upstream qemu (applied
> > > > this series of patches) to test this scenario.
> > > > First based on my test result this bug is related to this series of
> > > > patches, the conclusions are based on the following test results(All
> > > > test results are based on the above mentioned nic driver):
> > > > Case 1: downstream kernel + downstream qemu-kvm  -  pass
> > > > Case 2: downstream kernel + upstream qemu (doesn't included this
> > > > series of patches)  -  pass
> > > > Case 3: downstream kernel + upstream qemu (included this series of
> > > > patches)  - failed, reproduce ratio 100%
> > > Just as Dragos replied earlier, the firmware was already in a bogus
> > > state before the panic that I also suspect it has something to do with
> > > various bugs in the downstream kernel. You have to apply the 3 patches
> > > to the downstream kernel before you may kick of the relevant tests
> > > again. Please pay special attention to which specific command or step
> > > that triggers the unhealthy report from firmware, and let us know if you
> > > still run into any of them.
> > >
> > > In addition, as you seem to be testing the device hot plug and unplug
> > > use cases, for which the latest qemu should have related fixes
> > > below[1][2], but in case they are missed somehow it might also end up
> > > with bad firmware state to some extend. Just fyi.
> > >
> > > [1] db0d4017f9b9 ("net: parameterize the removing client from nc list")
> > > [2] e7891c575fb2 ("net: move backend cleanup to NIC cleanup")
> > >
> > > Thanks,
> > > -Siwei
> > > >
> > > > Then I also tried to test it with the net.git tree, but it will hit
> > > > the host kernel panic after compiling when rebooting the host. For the
> > > > call trace info please review following messages:
> > > > [    9.902851] No filesystem could mount root, tried:
> > > > [    9.902851]
> > > > [    9.909248] Kernel panic - not syncing: VFS: Unable to mount root
> > > > fs on "/dev/mapper/rhel_dell--per760--12-root" or unknown-block(0,0)
> > > > [    9.921335] CPU: 16 UID: 0 PID: 1 Comm: swapper/0 Not tainted 
> > > > 6.14.0-rc6+ #3
> > > > [    9.928398] Hardware name: Dell Inc. PowerEdge R760/0NH8MJ, BIOS
> > > > 1.3.2 03/28/2023
> > > > [    9.935876] Call Trace:
> > > > [    9.938332]  <TASK>
> > > > [    9.940436]  panic+0x356/0x380
> > > > [    9.943513]  mount_root_generic+0x2e7/0x300
> > > > [    9.947717]  prepare_namespace+0x65/0x270
> > > > [    9.951731]  kernel_init_freeable+0x2e2/0x310
> > > > [    9.956105]  ? __pfx_kernel_init+0x10/0x10
> > > > [    9.960221]  kernel_init+0x16/0x1d0
> > > > [    9.963715]  ret_from_fork+0x2d/0x50
> > > > [    9.967303]  ? __pfx_kernel_init+0x10/0x10
> > > > [    9.971404]  ret_from_fork_asm+0x1a/0x30
> > > > [    9.975348]  </TASK>
> > > > [    9.977555] Kernel Offset: 0xc00000 from 0xffffffff81000000
> > > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > > [   10.101881] ---[ end Kernel panic - not syncing: VFS: Unable to
> > > > mount root fs on "/dev/mapper/rhel_dell--per760--12-root" or
> > > > unknown-block(0,0) ]---
> > > >
> > > > # git log -1
> > > > commit 4003c9e78778e93188a09d6043a74f7154449d43 (HEAD -> main,
> > > > origin/main, origin/HEAD)
> > > > Merge: 8f7617f45009 2409fa66e29a
> > > > Author: Linus Torvalds <torva...@linux-foundation.org>
> > > > Date:   Thu Mar 13 07:58:48 2025 -1000
> > > >
> > > >      Merge tag 'net-6.14-rc7' of
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Lei
> > > >> Thanks
> > > >>
> > > >>> Best Regards
> > > >>> Lei
> > > >>>
> > > >>> On Fri, Mar 14, 2025 at 9:04 PM Jonah Palmer 
> > > >>> <jonah.pal...@oracle.com> wrote:
> > > >>>> Current memory operations like pinning may take a lot of time at the
> > > >>>> destination.  Currently they are done after the source of the 
> > > >>>> migration is
> > > >>>> stopped, and before the workload is resumed at the destination.  
> > > >>>> This is a
> > > >>>> period where neigher traffic can flow, nor the VM workload can 
> > > >>>> continue
> > > >>>> (downtime).
> > > >>>>
> > > >>>> We can do better as we know the memory layout of the guest RAM at the
> > > >>>> destination from the moment that all devices are initializaed.  So
> > > >>>> moving that operation allows QEMU to communicate the kernel the maps
> > > >>>> while the workload is still running in the source, so Linux can start
> > > >>>> mapping them.
> > > >>>>
> > > >>>> As a small drawback, there is a time in the initialization where QEMU
> > > >>>> cannot respond to QMP etc.  By some testing, this time is about
> > > >>>> 0.2seconds.  This may be further reduced (or increased) depending on 
> > > >>>> the
> > > >>>> vdpa driver and the platform hardware, and it is dominated by the 
> > > >>>> cost
> > > >>>> of memory pinning.
> > > >>>>
> > > >>>> This matches the time that we move out of the called downtime window.
> > > >>>> The downtime is measured as checking the trace timestamp from the 
> > > >>>> moment
> > > >>>> the source suspend the device to the moment the destination starts 
> > > >>>> the
> > > >>>> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> > > >>>> secs to 2.0949.
> > > >>>>
> > > >>>> Future directions on top of this series may include to move more 
> > > >>>> things ahead
> > > >>>> of the migration time, like set DRIVER_OK or perform actual 
> > > >>>> iterative migration
> > > >>>> of virtio-net devices.
> > > >>>>
> > > >>>> Comments are welcome.
> > > >>>>
> > > >>>> This series is a different approach of series [1]. As the title does 
> > > >>>> not
> > > >>>> reflect the changes anymore, please refer to the previous one to 
> > > >>>> know the
> > > >>>> series history.
> > > >>>>
> > > >>>> This series is based on [2], it must be applied after it.
> > > >>>>
> > > >>>> [Jonah Palmer]
> > > >>>> This series was rebased after [3] was pulled in, as [3] was a 
> > > >>>> prerequisite
> > > >>>> fix for this series.
> > > >>>>
> > > >>>> v3:
> > > >>>> ---
> > > >>>> * Rebase
> > > >>>>
> > > >>>> v2:
> > > >>>> ---
> > > >>>> * Move the memory listener registration to vhost_vdpa_set_owner 
> > > >>>> function.
> > > >>>> * Move the iova_tree allocation to net_vhost_vdpa_init.
> > > >>>>
> > > >>>> v1 at 
> > > >>>> https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> > > >>>>
> > > >>>> [1] 
> > > >>>> https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-epere...@redhat.com/
> > > >>>> [2] 
> > > >>>> https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> > > >>>> [3] 
> > > >>>> https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.pal...@oracle.com/
> > > >>>>
> > > >>>> Eugenio Pérez (7):
> > > >>>>    vdpa: check for iova tree initialized at net_client_start
> > > >>>>    vdpa: reorder vhost_vdpa_set_backend_cap
> > > >>>>    vdpa: set backend capabilities at vhost_vdpa_init
> > > >>>>    vdpa: add listener_registered
> > > >>>>    vdpa: reorder listener assignment
> > > >>>>    vdpa: move iova_tree allocation to net_vhost_vdpa_init
> > > >>>>    vdpa: move memory listener register to vhost_vdpa_init
> > > >>>>
> > > >>>>   hw/virtio/vhost-vdpa.c         | 98 
> > > >>>> ++++++++++++++++++++++------------
> > > >>>>   include/hw/virtio/vhost-vdpa.h | 22 +++++++-
> > > >>>>   net/vhost-vdpa.c               | 34 ++----------
> > > >>>>   3 files changed, 88 insertions(+), 66 deletions(-)
> > > >>>>
> > > >>>> --
> > > >>>> 2.43.5
> > > >>>>
> > > >>>>
> > >
> >
>

Re: [PATCH v3 0/7] Move memory listener register to vhost_vdpa_init

Reply via email to