On Thu, 25 May 2017 10:53:29 +0000 Hu Zhifeng <zhifeng...@hotmail.com> wrote:
> Dear all, > > I am running a fresh Fedora 23 and want to use kvm/qemu to run a windows VM > with GPU passthrough. > > My setup is as follow: > Host OS: Fedora 23 (Workstation x86_64) > Kernel: 4.2.3-300.fc23.x86_64 > QEMU version: qemu-2.4.0.1-1.fc23 > Guest VM: Windows 7 > CPU: Intel i7-6700K > Motherboard: Gigabyte B150-HD3 > IGD: Intel® HD Graphics 530 (used by the host) > Graphics Card: GT710 (used by the VM) > > First, enable IOMMU by appending the `intel_iommu=on` parameter to GRUB. > Next, prevent the kernel modules i915, nouveau and snd_hda_intel from being > loaded for both initramfs and system. > Then, load vfio-pci with ids (modprobe vfio-pci ids=10de:128b,10de:0e0f) > Last, run qemu like this: > qemu-system-x86_64 -enable-kvm -m 4G -cpu host,kvm=off -smp > 4,sockets=1,cores=2,threads=2 -hda ~/win7.img -usbdevice host:093a:2510 > -usbdevice host:0c45:7603 -device vfio-pci,host=01:00.0,x-vga=on -device > vfio-pci,host=01:00.1 -vga none You really want to avoid x-vga=on, especially with IGD host graphics. I'm also not sure why you're preventing i915 from loading if you intend to use IGD for the host graphics. > Everything looks good and the dedicated GPU detected by the guest VM (N.B. > GPU driver `378.92-desktop-win8-win7-64bit-international-whql.exe` was ready), > But the guest VM is running very slow, and I observed kernel panic which > generated by vfio_pci. > > Here's the log from dmesg: > [ 737.317946] vgaarb: device changed decodes: > PCI:0000:01:00.0,olddecodes=io+mem,decodes=io+mem:owns=none > [ 737.356996] vgaarb: device changed decodes: > PCI:0000:01:00.0,olddecodes=io+mem,decodes=io+mem:owns=none > [ 737.367606] vfio_pci: add [10de:128b[ffff:ffff]] class 0x000000/00000000 > [ 737.378437] vfio_pci: add [10de:0e0f[ffff:ffff]] class 0x000000/00000000 > [ 738.233680] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003) > [ 739.755715] kvm: zapping shadow pages for mmio generation wraparound > [ 739.874265] irq 16: nobody cared (try booting with the "irqpoll" option) > [ 739.874269] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 4.2.3-300.fc23.x86_64 #1 > [ 739.874270] Hardware name: Gigabyte Technology Co., Ltd. To be filled by > O.E.M./B150-HD3-CF, BIOS F5 03/11/2016 > [ 739.874271] 0000000000000000 e5300c14e6af3df1 ffff880470c03e28 > ffffffff81771fca > [ 739.874272] 0000000000000000 ffff88045b2844a4 ffff880470c03e58 > ffffffff810f88a5 > [ 739.874273] ffff880081f42e50 ffff88045b284400 0000000000000000 > 0000000000000010 > [ 739.874275] Call Trace: > [ 739.874276] <IRQ> [<ffffffff81771fca>] dump_stack+0x45/0x57 > [ 739.874281] [<ffffffff810f88a5>] __report_bad_irq+0x35/0xd0 > [ 739.874282] [<ffffffff810f8c44>] note_interrupt+0x244/0x290 > [ 739.874284] [<ffffffff810f607c>] handle_irq_event_percpu+0x11c/0x180 > [ 739.874285] [<ffffffff810f6110>] handle_irq_event+0x30/0x60 > [ 739.874286] [<ffffffff810f91f4>] handle_fasteoi_irq+0x84/0x150 > [ 739.874287] [<ffffffff81016e42>] handle_irq+0x72/0x120 > [ 739.874289] [<ffffffff810bd66a>] ? atomic_notifier_call_chain+0x1a/0x20 > [ 739.874291] [<ffffffff8177b5df>] do_IRQ+0x4f/0xe0 > [ 739.874292] [<ffffffff817794eb>] common_interrupt+0x6b/0x6b > [ 739.874292] <EOI> [<ffffffff81108a4f>] ? > hrtimer_start_range_ns+0x1bf/0x3b0 > [ 739.874296] [<ffffffff816160c0>] ? cpuidle_enter_state+0x130/0x270 > [ 739.874297] [<ffffffff8161609b>] ? cpuidle_enter_state+0x10b/0x270 > [ 739.874298] [<ffffffff81616237>] cpuidle_enter+0x17/0x20 > [ 739.874300] [<ffffffff810dfcc2>] call_cpuidle+0x32/0x60 > [ 739.874301] [<ffffffff81616213>] ? cpuidle_select+0x13/0x20 > [ 739.874302] [<ffffffff810dff58>] cpu_startup_entry+0x268/0x320 > [ 739.874304] [<ffffffff8176870c>] rest_init+0x7c/0x80 > [ 739.874305] [<ffffffff81d5702d>] start_kernel+0x49d/0x4be > [ 739.874307] [<ffffffff81d56120>] ? early_idt_handler_array+0x120/0x120 > [ 739.874308] [<ffffffff81d56339>] x86_64_start_reservations+0x2a/0x2c > [ 739.874309] [<ffffffff81d56485>] x86_64_start_kernel+0x14a/0x16d > [ 739.874309] handlers: > [ 739.874313] [<ffffffffa05172d0>] vfio_intx_handler [vfio_pci] > [ 739.874313] Disabling IRQ #16 What's happening here is that the spurious interrupt handling code is noting that there are too many unhandled interrupts on this IRQ and disabling it, which switches to a polling mode behavior and yes, performance will be terrible. My write-up on making Windows use MSI covers some of the background for this: http://vfio.blogspot.com/2014/09/vfio-interrupts-and-how-to-coax-windows.html In summary we rely on the device to tell us when an interrupt is pending to claim the interrupt, if it doesn't then we assume it's another device sharing the interrupt and let it go. If it's actually our device interrupting without indicating so or there's another device shouting on the same interrupt line, you can hit this problem. > What I've tried so far: > 1. Different graphics card (GTX750Ti), with same results My question would be whether the problem interrupt is the GPU or the audio. You could remove the audio assignment and see if it still occurs. If it is the audio device, then follow the guide above as GeForce audio interrupts are only marginally functional anyway. > 2. Different host OS (Fedora 24: Kernel 4.5.5-300.fc24.x86_64 + > qemu-2.6.2-8.fc24), without any issues That's interesting, I don't know what would be different, but also why are you running the original FC23 kernel when I know there are FC23 updates that bring it up to a 4.8 kernel? If you don't keep your software up to date, bugs are to be expected. > 3. Load vfio-pci with `nointxmask=1`, without any issues With this option we get an exclusive interrupt for the device and then we handle each interrupt under the assumption that it's for our device. If there's really something else pulling this interrupt, that might me we're injecting additional (spurious) interrupts into the guest. Generally this is ok so long as we don't hit a rate sufficient to trigger similar spurious interrupt shutdown in the guest. > 4. Remove `-hda ~/win7.img` from QEMU command (seabios only), still get the > same crash So you don't even have real guest drivers loaded... look in /proc/interrupts with the new kernel, are there multiple devices on the interrupt line with that kernel? > So I have some questions now: > 1. Is this a known issue? what is the root cause? Not a known issue, root cause covered above, certainly something that may be fixed in updated kernels, or maybe updated kernels just shutdown or have a driver for the device sharing the interrupt. > 2. Why Fedora 24 does not have this issue? related to kernel, qemu or other > components? You could try updating one or the other. > 3. Is `nointxmask=1` the right way to avoid crash? This is a valid workaround, but it means that vfio-pci will always require an exclusive INTx interrupt for any assigned device, which often makes it difficult to achieve a working configuration. As above, if the additional interrupts are not generated by the GPU/audio, then we're potentially injecting spurious interrupts into the guest. Thanks, Alex _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users