Hi Jason,
On 1/23/25 2:34 AM, Jason Wang wrote: > On Wed, Jan 22, 2025 at 3:55 PM Eric Auger <eric.au...@redhat.com> wrote: >> Hi Jason, >> >> >> On 1/22/25 8:17 AM, Jason Wang wrote: >>> On Wed, Jan 22, 2025 at 12:25 AM Eric Auger <eric.au...@redhat.com> wrote: >>>> Hi Jason, >>>> >>>> On 1/21/25 4:27 AM, Jason Wang wrote: >>>>> On Tue, Jan 21, 2025 at 1:33 AM Eric Auger <eric.au...@redhat.com> wrote: >>>>>> When a guest exposed with a vhost device and protected by an >>>>>> intel IOMMU gets rebooted, we sometimes observe a spurious warning: >>>>>> >>>>>> Fail to lookup the translated address ffffe000 >>>>>> >>>>>> We observe that the IOMMU gets disabled through a write to the global >>>>>> command register (CMAR_GCMD.TE) before the vhost device gets stopped. >>>>>> When this warning happens it can be observed an inflight IOTLB >>>>>> miss occurs after the IOMMU disable and before the vhost stop. In >>>>>> that case a flat translation occurs and the check in >>>>>> vhost_memory_region_lookup() fails. >>>>>> >>>>>> Let's disable the IOTLB callbacks when all IOMMU MRs have been >>>>>> unregistered. >>>>>> >>>>>> Signed-off-by: Eric Auger <eric.au...@redhat.com> >>>>>> --- >>>>>> hw/virtio/vhost.c | 4 ++++ >>>>>> 1 file changed, 4 insertions(+) >>>>>> >>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c >>>>>> index 6aa72fd434..128c2ab094 100644 >>>>>> --- a/hw/virtio/vhost.c >>>>>> +++ b/hw/virtio/vhost.c >>>>>> @@ -931,6 +931,10 @@ static void vhost_iommu_region_del(MemoryListener >>>>>> *listener, >>>>>> break; >>>>>> } >>>>>> } >>>>>> + if (QLIST_EMPTY(&dev->iommu_list) && >>>>>> + dev->vhost_ops->vhost_set_iotlb_callback) { >>>>>> + dev->vhost_ops->vhost_set_iotlb_callback(dev, false); >>>>>> + } >>>>> So the current code assumes: >>>>> >>>>> 1) IOMMU is enabled before vhost starts >>>>> 2) IOMMU is disabled after vhost stops >>>>> >>>>> This patch seems to fix 2) but not 1). Do we need to deal with the >>>>> IOMMU enabled after vhost starts? >>>> sorry I initially misunderstood the above comment. Indeed in the reboot >>>> case assumption 2) happens to be wrong. However what I currently do is: >>>> stop listening to iotlb miss requests from the kernel because my >>>> understanding is those requests are just spurious ones, generate >>>> warnings and we do not care since we are rebooting the system. >>>> >>>> However I do not claim this could handle the case where the IOMMU MR >>>> would be turned off and then turned on. I think in that case we should >>>> also flush the kernel IOTLB and this is not taken care of in this patch. >>>> Is it a relevant use case? >>> Not sure. >>> >>>> wrt removing assumption 1) and allow IOMMU enabled after vhost start. Is >>>> that a valid use case as the virtio driver is using the dma api? >>> It should not be but we can't assume the behaviour of the guest. It >>> could be buggy or even malicious. >> agreed >>> Btw, we had the following codes while handling te: >>> >>> /* Handle Translation Enable/Disable */ >>> static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) >>> { >>> if (s->dmar_enabled == en) { >>> return; >>> } >>> >>> trace_vtd_dmar_enable(en); >>> >>> ... >>> >>> vtd_reset_caches(s); >>> vtd_address_space_refresh_all(s); >>> } >>> >>> vtd_address_space_refresh_all() will basically disable the iommu >>> memory region. It looks not sufficient to trigger the region_del >>> callback, maybe we should delete the region or introduce listener >>> callback? >> This is exactly the code path which is entered in my use case. >> >> vtd_address_space_refresh_all(s) induces the vhost_iommu_region_del. But >> given the current implement of this latter the IOTLB callback is not unset >> and the kernel IOTLB is not refreshed. Also as I pointed out the >> hdev->mem->regions are not updated? shouldn't they. Can you explain what >> they correspond to? > Adding Peter for more ideas. > > I think it's better to find a way to trigger the listener here, probably: > > 1) add/delete the memory regions instead of enable/disable sorry I don't understand what you mean. The vhost_iommu_region_del call stack is provided below [1]. Write to the intel iommu GCMD TE bit induces a call to vhost_iommu_region_del. This happens before the vhost_dev_stop whose call stack is provided below [2] and originates from a bus reset. We may have inflight IOTLB miss requests happening between both. If this happens, vhost_device_iotlb_miss() fails because the IOVA is not translated anymore by the IOMMU and the iotlb.translated_addr returned by address_space_get_iotlb_entry() is not within the registered vhost_memory_regions looked up in vhost_memory_region_lookup(), hence the "Fail to lookup the translated address" message. It sounds wrong that vhost keeps on using IOVAs that are not translated anymore. It looks we have a reset ordering issue and this patch is just removing the sympton and not the cause. At the moment I don't really get what is initiating the intel iommu TE bit write. This is a guest action but is it initiated from a misordered qemu event? It could also relate to [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices https://lore.kernel.org/all/?q=s%3Aintel_iommu%3A+Reset+vIOMMU+after+all+the+rest+of+devices Thanks Eric [1] #0 vhost_iommu_region_del (listener=0x55555708a388, section=0x7ffff62bf030) at ../hw/virtio/vhost.c:927 #1 0x0000555555c851b4 in address_space_update_topology_pass (as=0x55555822e4f0, old_view=0x7fffe8e81850, new_view=0x55555723dfa0, adding=false) at ../system/memory.c:977 #2 0x0000555555c857a0 in address_space_set_flatview (as=0x55555822e4f0) at ../system/memory.c:1079 #3 0x0000555555c85986 in memory_region_transaction_commit () at ../system/memory.c:1132 #4 0x0000555555c89f25 in memory_region_set_enabled (mr=0x555557dc0d30, enabled=false) at ../system/memory.c:2686 #5 0x0000555555b5edb1 in vtd_switch_address_space (as=0x555557dc0c60) at ../hw/i386/intel_iommu.c:1735 #6 0x0000555555b5ee6f in vtd_switch_address_space_all (s=0x555557db1500) at ../hw/i386/intel_iommu.c:1779 #7 0x0000555555b64533 in vtd_address_space_refresh_all (s=0x555557db1500) at ../hw/i386/intel_iommu.c:4006 #8 0x0000555555b60770 in vtd_handle_gcmd_te (s=0x555557db1500, en=false) at ../hw/i386/intel_iommu.c:2419 #9 0x0000555555b60885 in vtd_handle_gcmd_write (s=0x555557db1500) at ../hw/i386/intel_iommu.c:2449 #10 0x0000555555b61db2 in vtd_mem_write (opaque=0x555557db1500, addr=24, val=100663296, size=4) at ../hw/i386/intel_iommu.c:2991 #11 0x0000555555c833e0 in memory_region_write_accessor (mr=0x555557db1840, addr=24, value=0x7ffff62bf3d8, size=4, shift=0, mask=4294967295, attrs=...) at ../system/memory.c:497 #12 0x0000555555c83711 in access_with_adjusted_size (addr=24, value=0x7ffff62bf3d8, size=4, access_size_min=4, access_size_max=8, access_fn=0x555555c832ea <memory_region_write_accessor>, mr=0x555557db1840, attrs=...) at ../system/memory.c:573 #13 0x0000555555c8698b in memory_region_dispatch_write (mr=0x555557db1840, addr=24, data=100663296, op=MO_32, attrs=...) at ../system/memory.c:1521 #14 0x0000555555c95619 in flatview_write_continue_step (attrs=..., buf=0x7ffff7fbb028 "", len=4, mr_addr=24, l=0x7ffff62bf4c0, mr=0x555557db1840) at ../system/physmem.c:2803 #15 0x0000555555c956eb in flatview_write_continue (fv=0x7fffe852d370, addr=4275634200, attrs=..., ptr=0x7ffff7fbb028, len=4, mr_addr=24, l=4, mr=0x555557db1840) at ../system/physmem.c:2833 #16 0x0000555555c957f9 in flatview_write (fv=0x7fffe852d370, addr=4275634200, attrs=..., buf=0x7ffff7fbb028, len=4) at ../system/physmem.c:2864 #17 0x0000555555c95c39 in address_space_write (as=0x555556ff1720 <address_space_memory>, addr=4275634200, attrs=..., buf=0x7ffff7fbb028, len=4) at ../system/physmem.c:2984 #18 0x0000555555c95cb1 in address_space_rw (as=0x555556ff1720 <address_space_memory>, addr=4275634200, attrs=..., buf=0x7ffff7fbb028, len=4, is_write=true) at ../system/physmem.c:2994 #19 0x0000555555ceeb56 in kvm_cpu_exec (cpu=0x55555727e950) at ../accel/kvm/kvm-all.c:3188 #20 0x0000555555cf1ea6 in kvm_vcpu_thread_fn (arg=0x55555727e950) at ../accel/kvm/kvm-accel-ops.c:50 #21 0x0000555555f6de20 in qemu_thread_start (args=0x555557288960) at ../util/qemu-thread-posix.c:541 #22 0x00007ffff7289e92 in start_thread () at /lib64/libc.so.6 [2] #0 vhost_dev_stop (hdev=0x55555708a2c0, vdev=0x555558234cb0, vrings=false) at ../hw/virtio/vhost.c:2222 #1 0x0000555555990266 in vhost_net_stop_one (net=0x55555708a2c0, dev=0x555558234cb0) at ../hw/net/vhost_net.c:350 #2 0x00005555559906ea in vhost_net_stop (dev=0x555558234cb0, ncs=0x55555826f968, data_queue_pairs=1, cvq=0) at ../hw/net/vhost_net.c:462 #3 0x0000555555c1ad0a in virtio_net_vhost_status (n=0x555558234cb0, status=0 '\000') at ../hw/net/virtio-net.c:318 #4 0x0000555555c1afaf in virtio_net_set_status (vdev=0x555558234cb0, status=0 '\000') at ../hw/net/virtio-net.c:393 #5 0x0000555555c591bd in virtio_set_status (vdev=0x555558234cb0, val=0 '\000') at ../hw/virtio/virtio.c:2242 #6 0x0000555555c595eb in virtio_reset (opaque=0x555558234cb0) at ../hw/virtio/virtio.c:2325 #7 0x0000555555a0d1e1 in virtio_bus_reset (bus=0x555558234c30) at ../hw/virtio/virtio-bus.c:109 #8 0x0000555555a13d51 in virtio_pci_reset (qdev=0x55555822c830) at ../hw/virtio/virtio-pci.c:2282 #9 0x0000555555a14016 in virtio_pci_bus_reset_hold (obj=0x55555822c830, type=RESET_TYPE_COLD) at ../hw/virtio/virtio-pci.c:2322 #10 0x0000555555d08831 in resettable_phase_hold (obj=0x55555822c830, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:180 #11 0x0000555555d00fc5 in bus_reset_child_foreach (obj=0x555557ffa3c0, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/bus.c:97 #12 0x0000555555d084d8 in resettable_child_foreach (rc=0x555557146f20, obj=0x555557ffa3c0, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 #13 0x0000555555d08792 in resettable_phase_hold (obj=0x555557ffa3c0, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 #14 0x0000555555d0543b in device_reset_child_foreach (obj=0x555557ff98e0, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/qdev.c:275 #15 0x0000555555d084d8 in resettable_child_foreach (rc=0x55555715a090, obj=0x555557ff98e0, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 #16 0x0000555555d08792 in resettable_phase_hold (obj=0x555557ff98e0, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 #17 0x0000555555d00fc5 in bus_reset_child_foreach (obj=0x555557445120, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/bus.c:97 #18 0x0000555555d084d8 in resettable_child_foreach (rc=0x555557146f20, obj=0x555557445120, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 #19 0x0000555555d08792 in resettable_phase_hold (obj=0x555557445120, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 #20 0x0000555555d0543b in device_reset_child_foreach (obj=0x5555572d0a00, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/qdev.c:275 #21 0x0000555555d084d8 in resettable_child_foreach (rc=0x5555570cf410, obj=0x5555572d0a00, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 #22 0x0000555555d08792 in resettable_phase_hold (obj=0x5555572d0a00, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 #23 0x0000555555d00fc5 in bus_reset_child_foreach (obj=0x55555723d270, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/bus.c:97 #24 0x0000555555d084d8 in resettable_child_foreach (rc=0x5555571bfde0, obj=0x55555723d270, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 #25 0x0000555555d08792 in resettable_phase_hold (obj=0x55555723d270, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 #26 0x0000555555d06d6d in resettable_container_child_foreach (obj=0x55555724a8a0, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resetcontainer.c:54 #27 0x0000555555d084d8 in resettable_child_foreach (rc=0x5555572180f0, obj=0x55555724a8a0, cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 #28 0x0000555555d08792 in resettable_phase_hold (obj=0x55555724a8a0, opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 #29 0x0000555555d0838d in resettable_assert_reset (obj=0x55555724a8a0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:58 #30 0x0000555555d082e5 in resettable_reset (obj=0x55555724a8a0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:45 #31 0x00005555558db5c8 in qemu_devices_reset (reason=SHUTDOWN_CAUSE_GUEST_RESET) at ../hw/core/reset.c:179 #32 0x0000555555b6f5b2 in pc_machine_reset (machine=0x555557234490, reason=SHUTDOWN_CAUSE_GUEST_RESET) at ../hw/i386/pc.c:1877 #33 0x0000555555a5a0a2 in qemu_system_reset (reason=SHUTDOWN_CAUSE_GUEST_RESET) at ../system/runstate.c:516 #34 0x0000555555a5a891 in main_loop_should_exit (status=0x7fffffffd574) at ../system/runstate.c:792 #35 0x0000555555a5a992 in qemu_main_loop () at ../system/runstate.c:825 #36 0x0000555555e9cced in qemu_default_main () at ../system/main.c:37 #37 0x0000555555e9cd2a in main (argc=58, argv=0x7fffffffd6d8) at ../system/main.c:48 > > or > > 2) introduce new listener ops that can be triggered when a region is > enabled or disabled > > Thanks > >> Thanks >> >> Eric >> >>> Thanks >>> >>>> Eric >>>> >>>> >>>>> Thanks >>>>> >>>>>> } >>>>>> >>>>>> void vhost_toggle_device_iotlb(VirtIODevice *vdev) >>>>>> -- >>>>>> 2.47.1 >>>>>>