Alex, (crossing fingers this goes into the correct thread).
I upgraded this machine to 4.4.0-42-generic. I spawned a single VM with 1 GPU immediately after the kernel upgrade. It works. It attached properly and in the VM when I ran lspci, it showed up properly. I deleted that VM and started up the system with 4x GPUs, and then it started exhibiting the same issue. Three of the GPUs attached properly. This appears to be that it was not resolved with upgrading the kernel. If you don't mind providing instructions on resetting the bus to see if I can narrow this down further (what you were talking about yesterday) that would be appreciated. Any other suggestions would be greatly appreciated as well. Here are the logs of the 4 GPU attachment that failed. On the host. /etc/var/log/libvirt/qemu/instance-00000185.log this shows the /usr/bin/kvm command issuing the connection of the following devices -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5 -device vfio-pci,host=10:00.0,id=hostdev1,bus=pci.0,addr=0x6 -device vfio-pci,host=0e:00.0,id=hostdev2,bus=pci.0,addr=0x7 -device vfio-pci,host=0d:00.0,id=hostdev3,bus=pci.0,addr=0x8 lspci -vnnn -d 10de:17c2 (on the host, I omitted the other 4 GPUs) 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) subsystem: NVIDIA Corporation Device [10de:1132] Flags: fast devsel, IRQ 28 Memory at b9000000 (32-bit, non-prefetchable) [size=16M] Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] I/O ports at 3000 [size=128] Expansion ROM at ba000000 [disabled] [size=512k] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Express Legacy Endpoint, MSI 00 Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel driver in use: vfio-pci 0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) subsystem: NVIDIA Corporation Device [10de:1132] Flags: fast devsel, IRQ 28 Memory at b9000000 (32-bit, non-prefetchable) [size=16M] Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] I/O ports at 3000 [size=128] Expansion ROM at ba000000 [disabled] [size=512k] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Express Legacy Endpoint, MSI 00 Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel driver in use: vfio-pci 0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: vfio-pci 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) subsystem: NVIDIA Corporation Device [10de:1132] Flags: fast devsel, IRQ 28 Memory at b9000000 (32-bit, non-prefetchable) [size=16M] Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] I/O ports at 3000 [size=128] Expansion ROM at ba000000 [disabled] [size=512k] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Express Legacy Endpoint, MSI 00 Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel driver in use: vfio-pci On the VM guest: lspci 00:06.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1) 00:07.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1) 00:08.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1) dmesg [ 0.787786] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff [ 0.788970] pci 0000:00:06.0: [10de:17c2] type 00 class 0x030000 [ 0.855192] pci 0000:00:07.0: [10de:17c2] type 00 class 0x030000 [ 0.925003] pci 0000:00:08.0: [10de:17c2] type 00 class 0x030000 On Mon, Oct 17, 2016 at 11:10 PM, Kevin Vasko <kva...@gmail.com> wrote: > Thanks. I'm an idiot. I just replied to the email directly after the > subscription and wasn't paying attention. Thank you for correcting it. > > I was originally running 3.13.0-86-generic upgraded to the 3.19 version to > try before I posted this, but got the same results. I'll try a newer > version of the kernel and see what happens. > > Sorry to be dense but what do you mean by "retrain properly"? I assume you > mean that once it fails to reset it just never recovers? > > We have 2 other machines that I've never seen this problem with so what > what you are saying makes sense. This system does have a slightly more > specialized PCI bus to be able to stick 8 cards on a single bus (at least > that is my understanding), so at this point, either I'm hitting a bug that > is fixed in the kernel, or this PCI bus is not doing something that > vfio-pci is expecting (would be my speculation). > > I'll report back my findings tomorrow. > > Thanks for the help. > > -Kevin > > > > > > > On Mon, Oct 17, 2016 at 5:53 PM, Alex Williamson < > alex.william...@redhat.com> wrote: > >> (generally a good idea to have a useful subject line) >> >> On Mon, 17 Oct 2016 16:26:15 -0500 >> Kevin Vasko <kva...@gmail.com> wrote: >> > >> > Any suggestions on debugging a !!! Unknown header type 7f? >> > >> >> This usually means that the device didn't come back from bus reset and >> re-reading the PCI config space where the device was just gives a -1 >> response. lspci tries to interpret that bogus data and gives results >> like you see. You might try a newer kernel, we've probably fixed some >> things in the bus reset path since v3.19. It looks like you continue >> to see the bogus data once it gets into this state, so it's probably >> not a "simple" device coming out of reset too slowly problem. Possibly >> the PCIe link doesn't retrain properly sometimes after a bus reset. If >> a new kernel doesn't help, I could give you instructions for performing >> a bus reset with setpci and you could test how reliably you can reset >> the device and read config space after. Thanks, >> >> Alex >> > >
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users