On Wed, May 22, 2024 at 03:40:08PM +0800, Aaron Lu wrote: > When Intel vIOMMU is used and irq remapping is enabled, using > bypass_iommu will cause following two callstacks dumped during kernel > boot and all PCI devices attached to root bridge lose their MSI > capabilities and fall back to using IOAPIC: > > [ 0.960262] ------------[ cut here ]------------ > [ 0.961245] WARNING: CPU: 3 PID: 1 at drivers/pci/msi/msi.h:121 > pci_msi_setup_msi_irqs+0x27/0x40 > [ 0.963070] Modules linked in: > [ 0.963695] CPU: 3 PID: 1 Comm: swapper/0 Not tainted > 6.9.0-rc7-00056-g45db3ab70092 #1 > [ 0.965225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 0.967382] RIP: 0010:pci_msi_setup_msi_irqs+0x27/0x40 > [ 0.968378] Code: 90 90 90 0f 1f 44 00 00 48 8b 87 30 03 00 00 89 f2 48 85 > c0 74 14 f6 40 28 01 74 0e 48 81 c7 c0 00 00 00 31 f6 e9 29 42 9e ff <0f> 0b > b8 ed ff ff ff c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 > [ 0.971756] RSP: 0000:ffffc90000017988 EFLAGS: 00010246 > [ 0.972669] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > 0000000000000000 > [ 0.973901] RDX: 0000000000000005 RSI: 0000000000000005 RDI: > ffff888100ee1000 > [ 0.975391] RBP: 0000000000000005 R08: ffff888101f44d90 R09: > 0000000000000228 > [ 0.976629] R10: 0000000000000001 R11: 0000000000008d3f R12: > ffffc90000017b80 > [ 0.977864] R13: ffff888102312000 R14: ffff888100ee1000 R15: > 0000000000000005 > [ 0.979092] FS: 0000000000000000(0000) GS:ffff88817bd80000(0000) > knlGS:0000000000000000 > [ 0.980473] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.981464] CR2: 0000000000000000 CR3: 000000000302e001 CR4: > 0000000000770ef0 > [ 0.982687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 0.983919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 0.985143] PKRU: 55555554 > [ 0.985625] Call Trace: > [ 0.986056] <TASK> > [ 0.986440] ? __warn+0x80/0x130 > [ 0.987014] ? pci_msi_setup_msi_irqs+0x27/0x40 > [ 0.987810] ? report_bug+0x18d/0x1c0 > [ 0.988443] ? handle_bug+0x3a/0x70 > [ 0.989026] ? exc_invalid_op+0x13/0x60 > [ 0.989672] ? asm_exc_invalid_op+0x16/0x20 > [ 0.990374] ? pci_msi_setup_msi_irqs+0x27/0x40 > [ 0.991118] __pci_enable_msix_range+0x325/0x5b0 > [ 0.991883] pci_alloc_irq_vectors_affinity+0xa9/0x110 > [ 0.992698] vp_find_vqs_msix+0x1a8/0x4c0 > [ 0.993332] vp_find_vqs+0x3a/0x1a0 > [ 0.993893] vp_modern_find_vqs+0x17/0x70 > [ 0.994531] init_vq+0x3ad/0x410 > [ 0.995051] ? __pfx_default_calc_sets+0x10/0x10 > [ 0.995789] virtblk_probe+0xeb/0xbc0 > [ 0.996362] ? up_write+0x74/0x160 > [ 0.996900] ? down_write+0x4d/0x80 > [ 0.997450] virtio_dev_probe+0x1bc/0x270 > [ 0.998059] really_probe+0xc1/0x390 > [ 0.998626] ? __pfx___driver_attach+0x10/0x10 > [ 0.999288] __driver_probe_device+0x78/0x150 > [ 0.999924] driver_probe_device+0x1f/0x90 > [ 1.000506] __driver_attach+0xce/0x1c0 > [ 1.001073] bus_for_each_dev+0x70/0xc0 > [ 1.001638] bus_add_driver+0x112/0x210 > [ 1.002191] driver_register+0x55/0x100 > [ 1.002760] virtio_blk_init+0x4c/0x90 > [ 1.003332] ? __pfx_virtio_blk_init+0x10/0x10 > [ 1.003974] do_one_initcall+0x41/0x240 > [ 1.004510] ? kernel_init_freeable+0x240/0x4a0 > [ 1.005142] kernel_init_freeable+0x321/0x4a0 > [ 1.005749] ? __pfx_kernel_init+0x10/0x10 > [ 1.006311] kernel_init+0x16/0x1c0 > [ 1.006798] ret_from_fork+0x2d/0x50 > [ 1.007303] ? __pfx_kernel_init+0x10/0x10 > [ 1.007883] ret_from_fork_asm+0x1a/0x30 > [ 1.008431] </TASK> > [ 1.008748] ---[ end trace 0000000000000000 ]--- > > Another callstack happens at pci_msi_teardown_msi_irqs(). > > Actually every PCI device will trigger these two paths. There are only > two callstack dumps because the two places use WARN_ON_ONCE(). > > What happened is: when irq remapping is enabled, kernel expects all PCI > device(or its parent bridges) appear in some DMA Remapping Hardware unit > Definition(DRHD)'s device scope list and if not, this device's irq domain > will become NULL and that would make this device's MSI functionality > enabling fail. > > Per my understanding, only virtualized system can have such a setup: irq > remapping enabled while not all PCI/PCIe devices appear in a DRHD's > device scope. > > Enhance the document by mentioning what could happen when bypass_iommu > is used. > > For detailed qemu cmdline and guest kernel dmesg, please see: > https://lore.kernel.org/qemu-devel/20240510072519.GA39314@ziqianlu-desk2/ > > Reported-by: Juro Bystricky <juro.bystri...@intel.com> > Signed-off-by: Aaron Lu <aaron...@intel.com>
Is this issue specific to Linux? > --- > docs/bypass-iommu.txt | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt > index e6677bddd3..8226f79104 100644 > --- a/docs/bypass-iommu.txt > +++ b/docs/bypass-iommu.txt > @@ -68,6 +68,11 @@ devices might send malicious dma request to virtual > machine if there is no > iommu isolation. So it would be necessary to only bypass iommu for trusted > device. > > +When Intel IOMMU is virtualized, if irq remapping is enabled, PCI and PCIe > +devices that bypassed vIOMMU will have their MSI/MSI-x functionalities > disabled functionality > +and fall back to IOAPIC. If this is not desired, disable irq remapping: > +qemu -device intel-iommu,intremap=off > + > Implementation > ============== > The bypass iommu feature includes: > -- > 2.45.0