When Intel vIOMMU is used and irq remapping is enabled, using bypass_iommu will cause following two callstacks dumped during kernel boot and all PCI devices attached to root bridge lose their MSI capabilities and fall back to using IOAPIC:
[ 0.960262] ------------[ cut here ]------------ [ 0.961245] WARNING: CPU: 3 PID: 1 at drivers/pci/msi/msi.h:121 pci_msi_setup_msi_irqs+0x27/0x40 [ 0.963070] Modules linked in: [ 0.963695] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc7-00056-g45db3ab70092 #1 [ 0.965225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 0.967382] RIP: 0010:pci_msi_setup_msi_irqs+0x27/0x40 [ 0.968378] Code: 90 90 90 0f 1f 44 00 00 48 8b 87 30 03 00 00 89 f2 48 85 c0 74 14 f6 40 28 01 74 0e 48 81 c7 c0 00 00 00 31 f6 e9 29 42 9e ff <0f> 0b b8 ed ff ff ff c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 [ 0.971756] RSP: 0000:ffffc90000017988 EFLAGS: 00010246 [ 0.972669] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 0.973901] RDX: 0000000000000005 RSI: 0000000000000005 RDI: ffff888100ee1000 [ 0.975391] RBP: 0000000000000005 R08: ffff888101f44d90 R09: 0000000000000228 [ 0.976629] R10: 0000000000000001 R11: 0000000000008d3f R12: ffffc90000017b80 [ 0.977864] R13: ffff888102312000 R14: ffff888100ee1000 R15: 0000000000000005 [ 0.979092] FS: 0000000000000000(0000) GS:ffff88817bd80000(0000) knlGS:0000000000000000 [ 0.980473] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.981464] CR2: 0000000000000000 CR3: 000000000302e001 CR4: 0000000000770ef0 [ 0.982687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.983919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.985143] PKRU: 55555554 [ 0.985625] Call Trace: [ 0.986056] <TASK> [ 0.986440] ? __warn+0x80/0x130 [ 0.987014] ? pci_msi_setup_msi_irqs+0x27/0x40 [ 0.987810] ? report_bug+0x18d/0x1c0 [ 0.988443] ? handle_bug+0x3a/0x70 [ 0.989026] ? exc_invalid_op+0x13/0x60 [ 0.989672] ? asm_exc_invalid_op+0x16/0x20 [ 0.990374] ? pci_msi_setup_msi_irqs+0x27/0x40 [ 0.991118] __pci_enable_msix_range+0x325/0x5b0 [ 0.991883] pci_alloc_irq_vectors_affinity+0xa9/0x110 [ 0.992698] vp_find_vqs_msix+0x1a8/0x4c0 [ 0.993332] vp_find_vqs+0x3a/0x1a0 [ 0.993893] vp_modern_find_vqs+0x17/0x70 [ 0.994531] init_vq+0x3ad/0x410 [ 0.995051] ? __pfx_default_calc_sets+0x10/0x10 [ 0.995789] virtblk_probe+0xeb/0xbc0 [ 0.996362] ? up_write+0x74/0x160 [ 0.996900] ? down_write+0x4d/0x80 [ 0.997450] virtio_dev_probe+0x1bc/0x270 [ 0.998059] really_probe+0xc1/0x390 [ 0.998626] ? __pfx___driver_attach+0x10/0x10 [ 0.999288] __driver_probe_device+0x78/0x150 [ 0.999924] driver_probe_device+0x1f/0x90 [ 1.000506] __driver_attach+0xce/0x1c0 [ 1.001073] bus_for_each_dev+0x70/0xc0 [ 1.001638] bus_add_driver+0x112/0x210 [ 1.002191] driver_register+0x55/0x100 [ 1.002760] virtio_blk_init+0x4c/0x90 [ 1.003332] ? __pfx_virtio_blk_init+0x10/0x10 [ 1.003974] do_one_initcall+0x41/0x240 [ 1.004510] ? kernel_init_freeable+0x240/0x4a0 [ 1.005142] kernel_init_freeable+0x321/0x4a0 [ 1.005749] ? __pfx_kernel_init+0x10/0x10 [ 1.006311] kernel_init+0x16/0x1c0 [ 1.006798] ret_from_fork+0x2d/0x50 [ 1.007303] ? __pfx_kernel_init+0x10/0x10 [ 1.007883] ret_from_fork_asm+0x1a/0x30 [ 1.008431] </TASK> [ 1.008748] ---[ end trace 0000000000000000 ]--- Another callstack happens at pci_msi_teardown_msi_irqs(). Actually every PCI device will trigger these two paths. There are only two callstack dumps because the two places use WARN_ON_ONCE(). What happened is: when irq remapping is enabled, kernel expects all PCI device(or its parent bridges) appear in some DMA Remapping Hardware unit Definition(DRHD)'s device scope list and if not, this device's irq domain will become NULL and that would make this device's MSI functionality enabling fail. Per my understanding, only virtualized system can have such a setup: irq remapping enabled while not all PCI/PCIe devices appear in a DRHD's device scope. Enhance the document by mentioning what could happen when bypass_iommu is used. For detailed qemu cmdline and guest kernel dmesg, please see: https://lore.kernel.org/qemu-devel/20240510072519.GA39314@ziqianlu-desk2/ Reported-by: Juro Bystricky <juro.bystri...@intel.com> Signed-off-by: Aaron Lu <aaron...@intel.com> --- docs/bypass-iommu.txt | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt index e6677bddd3..8226f79104 100644 --- a/docs/bypass-iommu.txt +++ b/docs/bypass-iommu.txt @@ -68,6 +68,11 @@ devices might send malicious dma request to virtual machine if there is no iommu isolation. So it would be necessary to only bypass iommu for trusted device. +When Intel IOMMU is virtualized, if irq remapping is enabled, PCI and PCIe +devices that bypassed vIOMMU will have their MSI/MSI-x functionalities disabled +and fall back to IOAPIC. If this is not desired, disable irq remapping: +qemu -device intel-iommu,intremap=off + Implementation ============== The bypass iommu feature includes: -- 2.45.0