On Wed, May 22, 2024 at 03:40:08PM +0800, Aaron Lu wrote:
> When Intel vIOMMU is used and irq remapping is enabled, using
> bypass_iommu will cause following two callstacks dumped during kernel
> boot and all PCI devices attached to root bridge lose their MSI
> capabilities and fall back to using IOAPIC:
> 
> [    0.960262] ------------[ cut here ]------------
> [    0.961245] WARNING: CPU: 3 PID: 1 at drivers/pci/msi/msi.h:121 
> pci_msi_setup_msi_irqs+0x27/0x40
> [    0.963070] Modules linked in:
> [    0.963695] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 
> 6.9.0-rc7-00056-g45db3ab70092 #1
> [    0.965225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [    0.967382] RIP: 0010:pci_msi_setup_msi_irqs+0x27/0x40
> [    0.968378] Code: 90 90 90 0f 1f 44 00 00 48 8b 87 30 03 00 00 89 f2 48 85 
> c0 74 14 f6 40 28 01 74 0e 48 81 c7 c0 00 00 00 31 f6 e9 29 42 9e ff <0f> 0b 
> b8 ed ff ff ff c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
> [    0.971756] RSP: 0000:ffffc90000017988 EFLAGS: 00010246
> [    0.972669] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
> 0000000000000000
> [    0.973901] RDX: 0000000000000005 RSI: 0000000000000005 RDI: 
> ffff888100ee1000
> [    0.975391] RBP: 0000000000000005 R08: ffff888101f44d90 R09: 
> 0000000000000228
> [    0.976629] R10: 0000000000000001 R11: 0000000000008d3f R12: 
> ffffc90000017b80
> [    0.977864] R13: ffff888102312000 R14: ffff888100ee1000 R15: 
> 0000000000000005
> [    0.979092] FS:  0000000000000000(0000) GS:ffff88817bd80000(0000) 
> knlGS:0000000000000000
> [    0.980473] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.981464] CR2: 0000000000000000 CR3: 000000000302e001 CR4: 
> 0000000000770ef0
> [    0.982687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [    0.983919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [    0.985143] PKRU: 55555554
> [    0.985625] Call Trace:
> [    0.986056]  <TASK>
> [    0.986440]  ? __warn+0x80/0x130
> [    0.987014]  ? pci_msi_setup_msi_irqs+0x27/0x40
> [    0.987810]  ? report_bug+0x18d/0x1c0
> [    0.988443]  ? handle_bug+0x3a/0x70
> [    0.989026]  ? exc_invalid_op+0x13/0x60
> [    0.989672]  ? asm_exc_invalid_op+0x16/0x20
> [    0.990374]  ? pci_msi_setup_msi_irqs+0x27/0x40
> [    0.991118]  __pci_enable_msix_range+0x325/0x5b0
> [    0.991883]  pci_alloc_irq_vectors_affinity+0xa9/0x110
> [    0.992698]  vp_find_vqs_msix+0x1a8/0x4c0
> [    0.993332]  vp_find_vqs+0x3a/0x1a0
> [    0.993893]  vp_modern_find_vqs+0x17/0x70
> [    0.994531]  init_vq+0x3ad/0x410
> [    0.995051]  ? __pfx_default_calc_sets+0x10/0x10
> [    0.995789]  virtblk_probe+0xeb/0xbc0
> [    0.996362]  ? up_write+0x74/0x160
> [    0.996900]  ? down_write+0x4d/0x80
> [    0.997450]  virtio_dev_probe+0x1bc/0x270
> [    0.998059]  really_probe+0xc1/0x390
> [    0.998626]  ? __pfx___driver_attach+0x10/0x10
> [    0.999288]  __driver_probe_device+0x78/0x150
> [    0.999924]  driver_probe_device+0x1f/0x90
> [    1.000506]  __driver_attach+0xce/0x1c0
> [    1.001073]  bus_for_each_dev+0x70/0xc0
> [    1.001638]  bus_add_driver+0x112/0x210
> [    1.002191]  driver_register+0x55/0x100
> [    1.002760]  virtio_blk_init+0x4c/0x90
> [    1.003332]  ? __pfx_virtio_blk_init+0x10/0x10
> [    1.003974]  do_one_initcall+0x41/0x240
> [    1.004510]  ? kernel_init_freeable+0x240/0x4a0
> [    1.005142]  kernel_init_freeable+0x321/0x4a0
> [    1.005749]  ? __pfx_kernel_init+0x10/0x10
> [    1.006311]  kernel_init+0x16/0x1c0
> [    1.006798]  ret_from_fork+0x2d/0x50
> [    1.007303]  ? __pfx_kernel_init+0x10/0x10
> [    1.007883]  ret_from_fork_asm+0x1a/0x30
> [    1.008431]  </TASK>
> [    1.008748] ---[ end trace 0000000000000000 ]---
> 
> Another callstack happens at pci_msi_teardown_msi_irqs().
> 
> Actually every PCI device will trigger these two paths. There are only
> two callstack dumps because the two places use WARN_ON_ONCE().
> 
> What happened is: when irq remapping is enabled, kernel expects all PCI
> device(or its parent bridges) appear in some DMA Remapping Hardware unit
> Definition(DRHD)'s device scope list and if not, this device's irq domain
> will become NULL and that would make this device's MSI functionality
> enabling fail.
> 
> Per my understanding, only virtualized system can have such a setup: irq
> remapping enabled while not all PCI/PCIe devices appear in a DRHD's
> device scope.
> 
> Enhance the document by mentioning what could happen when bypass_iommu
> is used.
> 
> For detailed qemu cmdline and guest kernel dmesg, please see:
> https://lore.kernel.org/qemu-devel/20240510072519.GA39314@ziqianlu-desk2/
> 
> Reported-by: Juro Bystricky <juro.bystri...@intel.com>
> Signed-off-by: Aaron Lu <aaron...@intel.com>

Is this issue specific to Linux?

> ---
>  docs/bypass-iommu.txt | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt
> index e6677bddd3..8226f79104 100644
> --- a/docs/bypass-iommu.txt
> +++ b/docs/bypass-iommu.txt
> @@ -68,6 +68,11 @@ devices might send malicious dma request to virtual 
> machine if there is no
>  iommu isolation. So it would be necessary to only bypass iommu for trusted
>  device.
>  
> +When Intel IOMMU is virtualized, if irq remapping is enabled, PCI and PCIe
> +devices that bypassed vIOMMU will have their MSI/MSI-x functionalities 
> disabled

functionality

> +and fall back to IOAPIC. If this is not desired, disable irq remapping:
> +qemu -device intel-iommu,intremap=off
> +
>  Implementation
>  ==============
>  The bypass iommu feature includes:
> -- 
> 2.45.0


Reply via email to