On Thu, Jan 23, 2025 at 4:31 PM Eric Auger <eric.au...@redhat.com> wrote: > > Hi Jason, > > > On 1/23/25 2:34 AM, Jason Wang wrote: > > On Wed, Jan 22, 2025 at 3:55 PM Eric Auger <eric.au...@redhat.com> wrote: > >> Hi Jason, > >> > >> > >> On 1/22/25 8:17 AM, Jason Wang wrote: > >>> On Wed, Jan 22, 2025 at 12:25 AM Eric Auger <eric.au...@redhat.com> wrote: > >>>> Hi Jason, > >>>> > >>>> On 1/21/25 4:27 AM, Jason Wang wrote: > >>>>> On Tue, Jan 21, 2025 at 1:33 AM Eric Auger <eric.au...@redhat.com> > >>>>> wrote: > >>>>>> When a guest exposed with a vhost device and protected by an > >>>>>> intel IOMMU gets rebooted, we sometimes observe a spurious warning: > >>>>>> > >>>>>> Fail to lookup the translated address ffffe000 > >>>>>> > >>>>>> We observe that the IOMMU gets disabled through a write to the global > >>>>>> command register (CMAR_GCMD.TE) before the vhost device gets stopped. > >>>>>> When this warning happens it can be observed an inflight IOTLB > >>>>>> miss occurs after the IOMMU disable and before the vhost stop. In > >>>>>> that case a flat translation occurs and the check in > >>>>>> vhost_memory_region_lookup() fails. > >>>>>> > >>>>>> Let's disable the IOTLB callbacks when all IOMMU MRs have been > >>>>>> unregistered. > >>>>>> > >>>>>> Signed-off-by: Eric Auger <eric.au...@redhat.com> > >>>>>> --- > >>>>>> hw/virtio/vhost.c | 4 ++++ > >>>>>> 1 file changed, 4 insertions(+) > >>>>>> > >>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > >>>>>> index 6aa72fd434..128c2ab094 100644 > >>>>>> --- a/hw/virtio/vhost.c > >>>>>> +++ b/hw/virtio/vhost.c > >>>>>> @@ -931,6 +931,10 @@ static void vhost_iommu_region_del(MemoryListener > >>>>>> *listener, > >>>>>> break; > >>>>>> } > >>>>>> } > >>>>>> + if (QLIST_EMPTY(&dev->iommu_list) && > >>>>>> + dev->vhost_ops->vhost_set_iotlb_callback) { > >>>>>> + dev->vhost_ops->vhost_set_iotlb_callback(dev, false); > >>>>>> + } > >>>>> So the current code assumes: > >>>>> > >>>>> 1) IOMMU is enabled before vhost starts > >>>>> 2) IOMMU is disabled after vhost stops > >>>>> > >>>>> This patch seems to fix 2) but not 1). Do we need to deal with the > >>>>> IOMMU enabled after vhost starts? > >>>> sorry I initially misunderstood the above comment. Indeed in the reboot > >>>> case assumption 2) happens to be wrong. However what I currently do is: > >>>> stop listening to iotlb miss requests from the kernel because my > >>>> understanding is those requests are just spurious ones, generate > >>>> warnings and we do not care since we are rebooting the system. > >>>> > >>>> However I do not claim this could handle the case where the IOMMU MR > >>>> would be turned off and then turned on. I think in that case we should > >>>> also flush the kernel IOTLB and this is not taken care of in this patch. > >>>> Is it a relevant use case? > >>> Not sure. > >>> > >>>> wrt removing assumption 1) and allow IOMMU enabled after vhost start. Is > >>>> that a valid use case as the virtio driver is using the dma api? > >>> It should not be but we can't assume the behaviour of the guest. It > >>> could be buggy or even malicious. > >> agreed > >>> Btw, we had the following codes while handling te: > >>> > >>> /* Handle Translation Enable/Disable */ > >>> static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) > >>> { > >>> if (s->dmar_enabled == en) { > >>> return; > >>> } > >>> > >>> trace_vtd_dmar_enable(en); > >>> > >>> ... > >>> > >>> vtd_reset_caches(s); > >>> vtd_address_space_refresh_all(s); > >>> } > >>> > >>> vtd_address_space_refresh_all() will basically disable the iommu > >>> memory region. It looks not sufficient to trigger the region_del > >>> callback, maybe we should delete the region or introduce listener > >>> callback? > >> This is exactly the code path which is entered in my use case. > >> > >> vtd_address_space_refresh_all(s) induces the vhost_iommu_region_del. But > >> given the current implement of this latter the IOTLB callback is not unset > >> and the kernel IOTLB is not refreshed. Also as I pointed out the > >> hdev->mem->regions are not updated? shouldn't they. Can you explain what > >> they correspond to? > > Adding Peter for more ideas. > > > > I think it's better to find a way to trigger the listener here, probably: > > > > 1) add/delete the memory regions instead of enable/disable > sorry I don't understand what you mean. The vhost_iommu_region_del call > stack is provided below [1]. Write to the intel iommu GCMD TE bit > induces a call to vhost_iommu_region_del. This happens before the > vhost_dev_stop whose call stack is provided below [2] and originates > from a bus reset.
Right, I see. > > We may have inflight IOTLB miss requests happening between both. > > If this happens, vhost_device_iotlb_miss() fails because the IOVA is not > translated anymore by the IOMMU and the iotlb.translated_addr returned > by address_space_get_iotlb_entry() is not within the registered > vhost_memory_regions looked up in vhost_memory_region_lookup(), hence > the "Fail to lookup the translated address" message. > > It sounds wrong that vhost keeps on using IOVAs that are not translated > anymore. It looks we have a reset ordering issue and this patch is just > removing the sympton and not the cause. Exactly. > > At the moment I don't really get what is initiating the intel iommu TE > bit write. This is a guest action but is it initiated from a misordered > qemu event? It could also be a guest bug which reset the vIOMMU before the device. Do we have a calltrace in the guest? Adding more vtd maintiners for more thoughts. Thanks > > It could also relate to > [PATCH 0/4] intel_iommu: Reset vIOMMU after all the rest of devices > https://lore.kernel.org/all/?q=s%3Aintel_iommu%3A+Reset+vIOMMU+after+all+the+rest+of+devices > > Thanks > > Eric > > > > > [1] > #0 vhost_iommu_region_del (listener=0x55555708a388, > section=0x7ffff62bf030) at ../hw/virtio/vhost.c:927 > #1 0x0000555555c851b4 in address_space_update_topology_pass > (as=0x55555822e4f0, old_view=0x7fffe8e81850, new_view=0x55555723dfa0, > adding=false) at ../system/memory.c:977 > #2 0x0000555555c857a0 in address_space_set_flatview (as=0x55555822e4f0) > at ../system/memory.c:1079 > #3 0x0000555555c85986 in memory_region_transaction_commit () at > ../system/memory.c:1132 > #4 0x0000555555c89f25 in memory_region_set_enabled (mr=0x555557dc0d30, > enabled=false) at ../system/memory.c:2686 > #5 0x0000555555b5edb1 in vtd_switch_address_space (as=0x555557dc0c60) > at ../hw/i386/intel_iommu.c:1735 > #6 0x0000555555b5ee6f in vtd_switch_address_space_all > (s=0x555557db1500) at ../hw/i386/intel_iommu.c:1779 > #7 0x0000555555b64533 in vtd_address_space_refresh_all > (s=0x555557db1500) at ../hw/i386/intel_iommu.c:4006 > #8 0x0000555555b60770 in vtd_handle_gcmd_te (s=0x555557db1500, > en=false) at ../hw/i386/intel_iommu.c:2419 > #9 0x0000555555b60885 in vtd_handle_gcmd_write (s=0x555557db1500) at > ../hw/i386/intel_iommu.c:2449 > #10 0x0000555555b61db2 in vtd_mem_write (opaque=0x555557db1500, addr=24, > val=100663296, size=4) at ../hw/i386/intel_iommu.c:2991 > #11 0x0000555555c833e0 in memory_region_write_accessor > (mr=0x555557db1840, addr=24, value=0x7ffff62bf3d8, size=4, shift=0, > mask=4294967295, attrs=...) at ../system/memory.c:497 > #12 0x0000555555c83711 in access_with_adjusted_size > (addr=24, value=0x7ffff62bf3d8, size=4, access_size_min=4, > access_size_max=8, access_fn=0x555555c832ea > <memory_region_write_accessor>, mr=0x555557db1840, attrs=...) > at ../system/memory.c:573 > #13 0x0000555555c8698b in memory_region_dispatch_write > (mr=0x555557db1840, addr=24, data=100663296, op=MO_32, attrs=...) at > ../system/memory.c:1521 > #14 0x0000555555c95619 in flatview_write_continue_step (attrs=..., > buf=0x7ffff7fbb028 "", len=4, mr_addr=24, l=0x7ffff62bf4c0, > mr=0x555557db1840) at ../system/physmem.c:2803 > #15 0x0000555555c956eb in flatview_write_continue (fv=0x7fffe852d370, > addr=4275634200, attrs=..., ptr=0x7ffff7fbb028, len=4, mr_addr=24, l=4, > mr=0x555557db1840) at ../system/physmem.c:2833 > #16 0x0000555555c957f9 in flatview_write (fv=0x7fffe852d370, > addr=4275634200, attrs=..., buf=0x7ffff7fbb028, len=4) at > ../system/physmem.c:2864 > #17 0x0000555555c95c39 in address_space_write (as=0x555556ff1720 > <address_space_memory>, addr=4275634200, attrs=..., buf=0x7ffff7fbb028, > len=4) at ../system/physmem.c:2984 > #18 0x0000555555c95cb1 in address_space_rw (as=0x555556ff1720 > <address_space_memory>, addr=4275634200, attrs=..., buf=0x7ffff7fbb028, > len=4, is_write=true) at ../system/physmem.c:2994 > #19 0x0000555555ceeb56 in kvm_cpu_exec (cpu=0x55555727e950) at > ../accel/kvm/kvm-all.c:3188 > #20 0x0000555555cf1ea6 in kvm_vcpu_thread_fn (arg=0x55555727e950) at > ../accel/kvm/kvm-accel-ops.c:50 > #21 0x0000555555f6de20 in qemu_thread_start (args=0x555557288960) at > ../util/qemu-thread-posix.c:541 > #22 0x00007ffff7289e92 in start_thread () at /lib64/libc.so.6 > > [2] > #0 vhost_dev_stop (hdev=0x55555708a2c0, vdev=0x555558234cb0, > vrings=false) at ../hw/virtio/vhost.c:2222 > #1 0x0000555555990266 in vhost_net_stop_one (net=0x55555708a2c0, > dev=0x555558234cb0) at ../hw/net/vhost_net.c:350 > #2 0x00005555559906ea in vhost_net_stop (dev=0x555558234cb0, > ncs=0x55555826f968, data_queue_pairs=1, cvq=0) at ../hw/net/vhost_net.c:462 > #3 0x0000555555c1ad0a in virtio_net_vhost_status (n=0x555558234cb0, > status=0 '\000') at ../hw/net/virtio-net.c:318 > #4 0x0000555555c1afaf in virtio_net_set_status (vdev=0x555558234cb0, > status=0 '\000') at ../hw/net/virtio-net.c:393 > #5 0x0000555555c591bd in virtio_set_status (vdev=0x555558234cb0, val=0 > '\000') at ../hw/virtio/virtio.c:2242 > #6 0x0000555555c595eb in virtio_reset (opaque=0x555558234cb0) at > ../hw/virtio/virtio.c:2325 > #7 0x0000555555a0d1e1 in virtio_bus_reset (bus=0x555558234c30) at > ../hw/virtio/virtio-bus.c:109 > #8 0x0000555555a13d51 in virtio_pci_reset (qdev=0x55555822c830) at > ../hw/virtio/virtio-pci.c:2282 > #9 0x0000555555a14016 in virtio_pci_bus_reset_hold (obj=0x55555822c830, > type=RESET_TYPE_COLD) at ../hw/virtio/virtio-pci.c:2322 > #10 0x0000555555d08831 in resettable_phase_hold (obj=0x55555822c830, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:180 > #11 0x0000555555d00fc5 in bus_reset_child_foreach (obj=0x555557ffa3c0, > cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, > type=RESET_TYPE_COLD) at ../hw/core/bus.c:97 > #12 0x0000555555d084d8 in resettable_child_foreach (rc=0x555557146f20, > obj=0x555557ffa3c0, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 > #13 0x0000555555d08792 in resettable_phase_hold (obj=0x555557ffa3c0, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 > #14 0x0000555555d0543b in device_reset_child_foreach > (obj=0x555557ff98e0, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/qdev.c:275 > #15 0x0000555555d084d8 in resettable_child_foreach (rc=0x55555715a090, > obj=0x555557ff98e0, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 > #16 0x0000555555d08792 in resettable_phase_hold (obj=0x555557ff98e0, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 > #17 0x0000555555d00fc5 in bus_reset_child_foreach (obj=0x555557445120, > cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, > type=RESET_TYPE_COLD) at ../hw/core/bus.c:97 > #18 0x0000555555d084d8 in resettable_child_foreach (rc=0x555557146f20, > obj=0x555557445120, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 > #19 0x0000555555d08792 in resettable_phase_hold (obj=0x555557445120, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 > #20 0x0000555555d0543b in device_reset_child_foreach > (obj=0x5555572d0a00, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/qdev.c:275 > #21 0x0000555555d084d8 in resettable_child_foreach (rc=0x5555570cf410, > obj=0x5555572d0a00, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 > #22 0x0000555555d08792 in resettable_phase_hold (obj=0x5555572d0a00, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 > #23 0x0000555555d00fc5 in bus_reset_child_foreach (obj=0x55555723d270, > cb=0x555555d086d5 <resettable_phase_hold>, opaque=0x0, > type=RESET_TYPE_COLD) at ../hw/core/bus.c:97 > #24 0x0000555555d084d8 in resettable_child_foreach (rc=0x5555571bfde0, > obj=0x55555723d270, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 > #25 0x0000555555d08792 in resettable_phase_hold (obj=0x55555723d270, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 > #26 0x0000555555d06d6d in resettable_container_child_foreach > (obj=0x55555724a8a0, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resetcontainer.c:54 > #27 0x0000555555d084d8 in resettable_child_foreach (rc=0x5555572180f0, > obj=0x55555724a8a0, cb=0x555555d086d5 <resettable_phase_hold>, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:92 > #28 0x0000555555d08792 in resettable_phase_hold (obj=0x55555724a8a0, > opaque=0x0, type=RESET_TYPE_COLD) at ../hw/core/resettable.c:169 > #29 0x0000555555d0838d in resettable_assert_reset (obj=0x55555724a8a0, > type=RESET_TYPE_COLD) at ../hw/core/resettable.c:58 > #30 0x0000555555d082e5 in resettable_reset (obj=0x55555724a8a0, > type=RESET_TYPE_COLD) at ../hw/core/resettable.c:45 > #31 0x00005555558db5c8 in qemu_devices_reset > (reason=SHUTDOWN_CAUSE_GUEST_RESET) at ../hw/core/reset.c:179 > #32 0x0000555555b6f5b2 in pc_machine_reset (machine=0x555557234490, > reason=SHUTDOWN_CAUSE_GUEST_RESET) at ../hw/i386/pc.c:1877 > #33 0x0000555555a5a0a2 in qemu_system_reset > (reason=SHUTDOWN_CAUSE_GUEST_RESET) at ../system/runstate.c:516 > #34 0x0000555555a5a891 in main_loop_should_exit (status=0x7fffffffd574) > at ../system/runstate.c:792 > #35 0x0000555555a5a992 in qemu_main_loop () at ../system/runstate.c:825 > #36 0x0000555555e9cced in qemu_default_main () at ../system/main.c:37 > #37 0x0000555555e9cd2a in main (argc=58, argv=0x7fffffffd6d8) at > ../system/main.c:48 > > > > > > or > > > > 2) introduce new listener ops that can be triggered when a region is > > enabled or disabled > > > > Thanks > > > >> Thanks > >> > >> Eric > >> > >>> Thanks > >>> > >>>> Eric > >>>> > >>>> > >>>>> Thanks > >>>>> > >>>>>> } > >>>>>> > >>>>>> void vhost_toggle_device_iotlb(VirtIODevice *vdev) > >>>>>> -- > >>>>>> 2.47.1 > >>>>>> >