On Wed, May 22, 2024 at 05:28:50AM -0400, Michael S. Tsirkin wrote: > On Wed, May 22, 2024 at 03:40:08PM +0800, Aaron Lu wrote: > > When Intel vIOMMU is used and irq remapping is enabled, using > > bypass_iommu will cause following two callstacks dumped during kernel > > boot and all PCI devices attached to root bridge lose their MSI > > capabilities and fall back to using IOAPIC: > > > > [ 0.960262] ------------[ cut here ]------------ > > [ 0.961245] WARNING: CPU: 3 PID: 1 at drivers/pci/msi/msi.h:121 > > pci_msi_setup_msi_irqs+0x27/0x40 > > [ 0.963070] Modules linked in: > > [ 0.963695] CPU: 3 PID: 1 Comm: swapper/0 Not tainted > > 6.9.0-rc7-00056-g45db3ab70092 #1 > > [ 0.965225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > > [ 0.967382] RIP: 0010:pci_msi_setup_msi_irqs+0x27/0x40 > > [ 0.968378] Code: 90 90 90 0f 1f 44 00 00 48 8b 87 30 03 00 00 89 f2 48 > > 85 c0 74 14 f6 40 28 01 74 0e 48 81 c7 c0 00 00 00 31 f6 e9 29 42 9e ff > > <0f> 0b b8 ed ff ff ff c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 > > [ 0.971756] RSP: 0000:ffffc90000017988 EFLAGS: 00010246 > > [ 0.972669] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > 0000000000000000 > > [ 0.973901] RDX: 0000000000000005 RSI: 0000000000000005 RDI: > > ffff888100ee1000 > > [ 0.975391] RBP: 0000000000000005 R08: ffff888101f44d90 R09: > > 0000000000000228 > > [ 0.976629] R10: 0000000000000001 R11: 0000000000008d3f R12: > > ffffc90000017b80 > > [ 0.977864] R13: ffff888102312000 R14: ffff888100ee1000 R15: > > 0000000000000005 > > [ 0.979092] FS: 0000000000000000(0000) GS:ffff88817bd80000(0000) > > knlGS:0000000000000000 > > [ 0.980473] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.981464] CR2: 0000000000000000 CR3: 000000000302e001 CR4: > > 0000000000770ef0 > > [ 0.982687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > [ 0.983919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > 0000000000000400 > > [ 0.985143] PKRU: 55555554 > > [ 0.985625] Call Trace: > > [ 0.986056] <TASK> > > [ 0.986440] ? __warn+0x80/0x130 > > [ 0.987014] ? pci_msi_setup_msi_irqs+0x27/0x40 > > [ 0.987810] ? report_bug+0x18d/0x1c0 > > [ 0.988443] ? handle_bug+0x3a/0x70 > > [ 0.989026] ? exc_invalid_op+0x13/0x60 > > [ 0.989672] ? asm_exc_invalid_op+0x16/0x20 > > [ 0.990374] ? pci_msi_setup_msi_irqs+0x27/0x40 > > [ 0.991118] __pci_enable_msix_range+0x325/0x5b0 > > [ 0.991883] pci_alloc_irq_vectors_affinity+0xa9/0x110 > > [ 0.992698] vp_find_vqs_msix+0x1a8/0x4c0 > > [ 0.993332] vp_find_vqs+0x3a/0x1a0 > > [ 0.993893] vp_modern_find_vqs+0x17/0x70 > > [ 0.994531] init_vq+0x3ad/0x410 > > [ 0.995051] ? __pfx_default_calc_sets+0x10/0x10 > > [ 0.995789] virtblk_probe+0xeb/0xbc0 > > [ 0.996362] ? up_write+0x74/0x160 > > [ 0.996900] ? down_write+0x4d/0x80 > > [ 0.997450] virtio_dev_probe+0x1bc/0x270 > > [ 0.998059] really_probe+0xc1/0x390 > > [ 0.998626] ? __pfx___driver_attach+0x10/0x10 > > [ 0.999288] __driver_probe_device+0x78/0x150 > > [ 0.999924] driver_probe_device+0x1f/0x90 > > [ 1.000506] __driver_attach+0xce/0x1c0 > > [ 1.001073] bus_for_each_dev+0x70/0xc0 > > [ 1.001638] bus_add_driver+0x112/0x210 > > [ 1.002191] driver_register+0x55/0x100 > > [ 1.002760] virtio_blk_init+0x4c/0x90 > > [ 1.003332] ? __pfx_virtio_blk_init+0x10/0x10 > > [ 1.003974] do_one_initcall+0x41/0x240 > > [ 1.004510] ? kernel_init_freeable+0x240/0x4a0 > > [ 1.005142] kernel_init_freeable+0x321/0x4a0 > > [ 1.005749] ? __pfx_kernel_init+0x10/0x10 > > [ 1.006311] kernel_init+0x16/0x1c0 > > [ 1.006798] ret_from_fork+0x2d/0x50 > > [ 1.007303] ? __pfx_kernel_init+0x10/0x10 > > [ 1.007883] ret_from_fork_asm+0x1a/0x30 > > [ 1.008431] </TASK> > > [ 1.008748] ---[ end trace 0000000000000000 ]--- > > > > Another callstack happens at pci_msi_teardown_msi_irqs(). > > > > Actually every PCI device will trigger these two paths. There are only > > two callstack dumps because the two places use WARN_ON_ONCE(). > > > > What happened is: when irq remapping is enabled, kernel expects all PCI > > device(or its parent bridges) appear in some DMA Remapping Hardware unit > > Definition(DRHD)'s device scope list and if not, this device's irq domain > > will become NULL and that would make this device's MSI functionality > > enabling fail. > > > > Per my understanding, only virtualized system can have such a setup: irq > > remapping enabled while not all PCI/PCIe devices appear in a DRHD's > > device scope. > > > > Enhance the document by mentioning what could happen when bypass_iommu > > is used. > > > > For detailed qemu cmdline and guest kernel dmesg, please see: > > https://lore.kernel.org/qemu-devel/20240510072519.GA39314@ziqianlu-desk2/ > > > > Reported-by: Juro Bystricky <juro.bystri...@intel.com> > > Signed-off-by: Aaron Lu <aaron...@intel.com> > > Is this issue specific to Linux?
Ah, to be honest, I have never tried any other guest OS. I just did a quick check using FreeBSD 13.2 and it appears FreeBSD doesn't enable MSI for PCI devices even without vIOMMU: root@bsdvm:~ # lspci ... ... 00:03.0 SCSI storage controller: Red Hat, Inc. Virtio block device Subsystem: Red Hat, Inc. Device 0002 pcilib: 0000:00:03.0 64-bit device address ignored. Flags: bus master, fast devsel, latency 0, IRQ 23 (<-note here) I/O ports at c000 Memory at fc053000 (32-bit, non-prefetchable) Memory at <unassigned> (64-bit, prefetchable) Memory at <unassigned> (32-bit, non-prefetchable) Capabilities: [98] MSI-X: Enable- Count=9 Masked- (<-and here) and from dmesg, I saw: root@bsdvm:~ # dmesg |grep apic ioapic0 <Version 2.0> irqs 0-23 So it appears MSI functionality is indeed not enabled even without using vIOMMU. Adding vIOMMU and bypass iommu doesn't change anything. But I rarely use FreeBSD so I may miss something here. I do not have Windows VM right now and will report back once I finished testing there. > > --- > > docs/bypass-iommu.txt | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt > > index e6677bddd3..8226f79104 100644 > > --- a/docs/bypass-iommu.txt > > +++ b/docs/bypass-iommu.txt > > @@ -68,6 +68,11 @@ devices might send malicious dma request to virtual > > machine if there is no > > iommu isolation. So it would be necessary to only bypass iommu for trusted > > device. > > > > +When Intel IOMMU is virtualized, if irq remapping is enabled, PCI and PCIe > > +devices that bypassed vIOMMU will have their MSI/MSI-x functionalities > > disabled > > functionality Will correct this, thanks. > > +and fall back to IOAPIC. If this is not desired, disable irq remapping: > > +qemu -device intel-iommu,intremap=off > > + > > Implementation > > ============== > > The bypass iommu feature includes: > > -- > > 2.45.0 >