On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambi...@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambi...@gmail.com> wrote:
> >> I managed to fix a few issues while testing this patch series.
> >> There is still one issue that I am unable to resolve. I thought
> >> I would send this patch series for review in case I have missed
> >> something.
> >>
> >> The issue is that this patch series does not work every time. I
> >> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >> works.
> >
> > So we're on a very good track then!
> >
> >> When this doesn't work, both VMs throw a "Destination Host
> >> Unreachable" error. This is sometimes (not always) accompanied
> >> by the following kernel error (thrown by L2-kernel):
> >>
> >> virtio_net virtio1: output.0:id 1 is not a head!
> >>
> >
> > How many packets have been sent or received before hitting this? If
> > the answer to that is "the vq size", maybe there is a bug in the code
> > that handles the wraparound of the packed vq, as the used and avail
> > flags need to be twisted. You can count them in the SVQ code.
>
> I did a lot more testing. This issue is quite unpredictable in terms
> of the time at which it appears after booting L2. So far, it almost
> always appears after booting L2. Even when pinging works, this issue
> appears after several seconds of pinging.
>

Maybe you can speed it up with ping -f?

> The total number of svq descriptors varied in every test run. But in
> every case, all 256 indices were filled in the descriptor region for
> vq with vq_idx = 0. This is the RX vq, right?

Right!

> This was filled while L2
> was booting. In the case when the ctrl vq is disabled, I am not sure
> what is responsible for filling the vqs in the data plane during
> booting.
>

The nested guest's driver fills the rx queue at startup. After that,
that nested guest kicks and SVQ receives the descriptors. It copies
the descriptors to the shadow virtqueue and then kicks L0 QEMU.

> =====
> The issue is hit most frequently when the following command is run
> in L0:
> $ ip addr add 111.1.1.1/24 dev tap0
> $ ip link set tap0 up
>
> or, running the following in L2:
> # ip addr add 111.1.1.2/24 dev eth0
>

I guess those are able to start the network, aren't they?

> The other vq (vq_idx=1) is not filled completely before the issue is
> hit.
> I have been noting down the numbers and here is an example:
>
> 295 descriptors were added individually to the queues i.e., there were no 
> chains (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>      |---- 27 descriptors (ids 0 through 26) were received later from the 
> device (vhost_svq_get_buf_packed)
> |_ 39 additions in vq_idx = 1
>      |_ 13 descriptors had id = 0
>      |_ 26 descriptors had id = 1
>      |---- All descriptors were received at some point from the device 
> (vhost_svq_get_buf_packed)
>
> There was one case in which vq_idx=0 had wrapped around. I verified
> that flags were set appropriately during the wrap (avail and used flags
> were flipped as expected).
>

Ok sounds like you're able to reach it before filling the queue. I'd
go for debugging notifications for this one then. More on this below.

> =====
> The next common situation where this issue is hit is during startup.
> Before L2 can finish booting successfully, this error is thrown:
>
> virtio_net virtio1: output.0:id 0 is not a head!
>
> 258 descriptors were added individually to the queues during startup (there 
> were no chains) (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>     |---- None of them were received by the device (vhost_svq_get_buf_packed)
> |_ 2 additions in vq_idx = 1
>     |_ id = 0 in index 0
>     |_ id = 1 in index 1
>     |---- Both descriptors were received at some point during startup from 
> the device (vhost_svq_get_buf_packed)
>
> =====
> Another case is after several seconds of pinging L0 from L2.
>
> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>

So the L2 guest sees a descriptor it has not made available
previously. This can be caused because SVQ returns the same descriptor
twice, or it doesn't fill the id or flags properly. It can also be
caused because we're not protecting the write ordering in the ring,
but I don't see anything obviously wrong by looking at the code.

> 366 descriptors were added individually to the queues i.e., there were no 
> chains (vhost_svq_add_packed)
> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used 
> flags inverted for 33 descriptors
> |   |---- 40 descriptors (ids 0 through 39) were received from the device 
> (vhost_svq_get_buf_packed)
> |_ 77 additions in vq_idx = 1
>      |_ 76 descriptors had id = 0
>      |_ 1 descriptor had id = 1
>      |---- all 77 descriptors were received at some point from the device 
> (vhost_svq_get_buf_packed)
>
> I am not entirely sure now if there's an issue in the packed vq
> implementation in QEMU or if this is being caused due to some sort
> of race condition in linux.
>
> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() 
> [3].
>

Let's keep debugging further. Can you trace the ids that the L2 kernel
makes available, and then the ones that it uses? At the same time, can
you trace the ids that the svq sees in vhost_svq_get_buf and the ones
that flushes? This allows us to check the set of available descriptors
at any given time.

> >> This error is not thrown always, but when it is thrown, the id
> >> varies. This is invariably followed by a soft lockup:
> >> [...]
> >> [  284.662292] Call Trace:
> >> [  284.662292]  <IRQ>
> >> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
> >> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
> >> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
> >> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
> >> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> >> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
> >> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
> >> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
> >> [  284.662292]  ? netif_skb_features+0x98/0x300
> >> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
> >> [  284.662292]  sch_direct_xmit+0xa4/0x390
> >> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
> >> [  284.662292]  ? nf_hook_slow+0x42/0xf0
> >> [  284.662292]  ip_finish_output2+0x2b8/0x580
> >> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
> >> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> >> [  284.662292]  call_timer_fn+0x21/0x130
> >> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> >> [  284.662292]  __run_timers+0x21f/0x2b0
> >> [  284.662292]  run_timer_softirq+0x1d/0x40
> >> [  284.662292]  __do_softirq+0xc9/0x2c8
> >> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
> >> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
> >> [  284.662292]  </IRQ>
> >> [  284.662292]  <TASK>
> >> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
> >> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 
> >> 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 
> >> <c3> cc c0
> >> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
> >> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 
> >> 0000000000000000
> >> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 
> >> 00000000000289ec
> >> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 
> >> 0000000000000001
> >> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 
> >> 0000000000000000
> >> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 
> >> 0000000000000000
> >> [  284.662292]  default_idle+0x9/0x20
> >> [  284.662292]  default_idle_call+0x2c/0xe0
> >> [  284.662292]  do_idle+0x226/0x270
> >> [  284.662292]  cpu_startup_entry+0x2a/0x30
> >> [  284.662292]  start_secondary+0x11e/0x140
> >> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
> >> [  284.662292]  </TASK>
> >>
> >> The soft lockup seems to happen in
> >> drivers/net/virtio_net.c:start_xmit() [1].
> >>
> >
> > Maybe it gets stuck in the do {} while(...
> > !virtqueue_enable_cb_delayed()) ? you can add a printk in
> > virtqueue_enable_cb_delayed return and check if it matches with the
> > speed you're sending or receiving ping. For example, if ping is each
> > second, you should not see a lot of traces.
> >
> > If this does not work I'd try never disabling notifications, both in
> > the kernel and SVQ, and check if that works.
>
> In order to disable notifications, will something have to be commented
> out in the implementation?
>

To *never* disable notifications you should comment out the SVQ calls
to virtio_queue_set_notification and vhost_svq_disable_notification.
This way the other side (L0 device in QEMU and the guest) are forced
to notify SVQ always, and we can check if that solves the first issue,

> >> [...]
> >> QEMU command to boot L1:
> >>
> >> $ sudo ./qemu/build/qemu-system-x86_64 \
> >> -enable-kvm \
> >> -drive 
> >> file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >> -net nic,model=virtio \
> >> -net user,hostfwd=tcp::2222-:22 \
> >> -device intel-iommu,snoop-control=on \
> >> -device 
> >> virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4
> >>  \
> >> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
> >> -nographic \
> >> -m 8G \
> >> -smp 4 \
> >> -M q35 \
> >> -cpu host 2>&1 | tee vm.log
> >>
>
> I have added "-device virtio-net-pci,indirect_desc=off,queue_reset=off"
> to the L0 QEMU command to boot L1.
>
> Thanks,
> Sahil
>
> [1] 
> https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762
> [2] 
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L290
> [3] 
> https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1564
>


Reply via email to