On Wed, Jul 5, 2017 at 10:23 PM, Thiago Ramon <thiagora...@gmail.com> wrote: > > > Here, dropped the raw message in pastebin: https://pastebin.com/hfJ6ryJg > > That particular run was trying to pass the 980 Ti, which is the boot > device, and which probably had something else prodding at it (I'll give it > a try again and check what else was attaching to it). I've mostly focused > on passing the 1060 though, which doesn't get touched by anything but > vfio-pci, and also doesn't show any mmap issues, here's the last QEMU run > with SeaBIOS: > > https://pastebin.com/DEPpewCH > > And the last one from OVMF: > > https://pastebin.com/L7gkrm36 > > On the kernel log, I only get the vfio_bar_restore messages. One > interesting and consistent pattern is that SeaBIOS always generate 2 pairs > of warnings (one for GPU, one audio), while OVMF generates quite a bit > (dozen+, don't have a log handy). Probably not relevant, as apparently the > failure happens before the first message anyway. > > Another detail that may be relevant: Whenever I try a passthrough (and > fail), the kernel fails to soft restart. It gets to the last stage where it > would do a soft reset but the console just sits there. Could this just be > vfio_pci trying to do something with the unresponsive card, or something > else that may be a clue to what's going on? >
Yep, here's what I suspected about the D3 warning: >PCI state after passthrough attempt: > 29:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] [10de:17c8] (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: vfio-pci > Kernel modules: nouveau, nvidia_drm, nvidia > > 29:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition Audio [10de:0fb0] (rev ff) (prog-if ff) > !!! Unknown header type 7f > Kernel driver in use: vfio-pci > Kernel modules: snd_hda_intel The card isn't actually stuck in D3, it's basically disappeared from the bus and all reads from config space are returning -1, which is indistinguishable from from D3 power state for the bits that tell us the power state. This is probably the result of doing a bus reset, but that's also our only way of putting the device back to a known state before starting it in the VM. You might try to see if you can reproduce this result manually with setpci. We do a bus reset by finding the bridge upstream of the device, lspci -t is handy for this with a tree view of the PCI topology. As an example: https://pastebin.com/c3URT6vx Bus numbers are shown in brackets, so if I want the parent bridge of device 01:00.0, look to the left of [01]--00.0 to find 01.0. This is attached to the root bus at [0000:00], so the full address of the parent bridge is 0000:00:01.0. We can access the bridge control register using # setpci -s 0000:00:01.0 BRIDGE_CONTROL The secondary bus reset bit is 0x40. We want to set this bit: # setpci -s 0000:00:01.0 BRIDGE_CONTROL=40:40 Then clear it: # setpci -s 0000:00:01.0 BRIDGE_CONTROL=00:40 Then run lspci on the bus to see if the device is still present. In your case it would be bus 29, so you'd run # lspci -vvv -s 0000:29: Do you get output like above with the 'Unknown header type 7f' or a complete listing of the device? Be sure to reboot the system after running this test, regardless of the result the device will be re-initialized, and clearly nothing should be using the device while doing this. If the graphics card doesn't recover from a bus reset, then something about this system setup is not compatible with this use case. Thanks, Alex
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users