Issues with pcipassthrough reliably working with Titan X GPUs We have a machine that has 8 Titan X GPUs in it (Cirrascale GX8). We are trying to use KVM (openstack is doing the provisioning) and pcipassthrough to launch VM instances on this system so multiple users can utilize GPUs, however having some issues doing so. I would like some help/tips if possible on how to debug this issue.
The problem is that it seems that under certain circumstances the attachment of the GPU to the VM will fail (seemingly randomly). I will see "unknown header type 7f, ignoring device" in the VM with the id of the device it tried to attach. We are using Ubutnu 3.19.0-71-generic. I blacklisted the cards so the host UI doesn't attach. sudo gedit /etc/modules and add: pci_stub vfio vfio_iommu_type1 vfio_pci kvm kvm_intel sudo update-grub sudo vi /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1" sudo update-grub sudo gedit /etc/initramfs-tools/modules pci_stub ids=10de:17c2,10de:1132,10de:0fb0 sudo update-initramfs -u reboot I'll run the command lspci -nnk -d 10de:17c2 : 04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Kernel driver in use: pci-stub If I go through the process of creating a VM I will get (shown below) I get random results on whether a device will attach properly to the VM (sometimes it does, sometimes it doesn't). Example of a failed attachment On the host: #: dmesg Netfilter messages via NETLINK v0.30 ip_set: protocol 6 vfio-pci 0000:10:00.0: enabling device (0100 -> 0103) vfio_ecap_init: 0000:10:00.0 hiding ecap 0x1e@0x258 vfio_ecap_init: 0000:10:00.0 hiding ecap 0x19@0x900 kvm:zapping shadow pages for mmio generation wraparound kvm [11446]: vcpu0 unhandled rdmsr: 0x606 kvm [11446]: vcpu0 unhandled rdmsr: 0x611 kvm [11446]: vcpu0 unhandled rdmsr: 0x639 kvm [11446]: vcpu0 unhandled rdmsr: 0x641 kvm [11446]: vcpu0 unhandled rdmsr: 0x619 lspci -vnnn -d 10de:17c2 (this output is really long, so I'm only including the 10:00.0 device): 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: vfio-pci The instances kvm command (near the end you will see -device vfio-pci,host=10:00.0 where it is passing the GPU to the VM). /usr/bin/kvm -name instance-00000182 -S -machine pc-i440fx-vivid,accel=kvm,usb=off -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 16384 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid dc37c94f-d6d2-42ac-8fff-1c3a6604f317 -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=13.0.0,serial=8e34e073-7b4c-4e69-84fa-2d044032ad30,uuid=dc37c94f-d6d2-42ac-8fff-1c3a6604f317,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000182.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/dc37c94f-d6d2-42ac-8fff-1c3a6604f317/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:cf:9c:1d,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/dc37c94f-d6d2-42ac-8fff-1c3a6604f317/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device vfio-pci,host=10:00.0,id=hostdev0,bus=pci.0,addr=0x5 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on On the CentOS7 VM running lspci -vnnn, the device is not shown. On the CentOS7 VM looking at dmesg I see this. [ 0.751028] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff [ 0.751041] pci 0000:00:05.0: unknown header type 7f, ignoring device At this point without doing anything different (no reboot), I startup another VM (device 0f:00.0_, nothing being different (other than the system using a different device ID), it will startup successfully and attach properly. On the Host lspci -vnnn -d 10de:17c2 0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) subsystem: NVIDIA Corporation Device [10de:1132] Flags: fast devsel, IRQ 28 Memory at b9000000 (32-bit, non-prefetchable) [size=16M] Memory at 38ff20000000 (64-bit, prefetchable) [size=256M] Memory at 38ff30000000 (64-bit, prefetchable) [size=32M] I/O ports at 3000 [size=128] Expansion ROM at ba000000 [disabled] [size=512k] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Express Legacy Endpoint, MSI 00 Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Kernel driver in use: vfio-pci 10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: vfio-pci KVM command (as you can see host=0f:00.0 is the GPU on the host device) /usr/bin/kvm -name instance-00000183 -S -machine pc-i440fx-vivid,accel=kvm,usb=off -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 16000 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid 3c844181-2ae2-46d6-83ad-22363ad26e35 -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=13.1.1,serial=fa62c66d-7e84-45a4-addd-bf293c06c348,uuid=3c844181-2ae2-46d6-83ad-22363ad26e35,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000183.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/3c844181-2ae2-46d6-83ad-22363ad26e35/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:35:57:04,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/3c844181-2ae2-46d6-83ad-22363ad26e35/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on On the guest VM lspci -vnnn -d 10de:17c2 00:05.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1132] Physical Slot: 5 Flags: fast devsel, IRQ 11 Memory at fd000000 (32-bit, non-prefetchable) [size=16M] Memory at e0000000 (64-bit, prefetchable) [size=256M] Memory at f2000000 (64-bit, prefetchable) [size=32M] I/O ports at c000 [size=128] Expansion ROM at fe000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 dmesg [ 0.839593] pci 0000:00:05.0: [10de:17c2] type 00 class 0x030000 Which shows that the PCIPassthrough is working properly. The only way to reset the failed evice in (rev ff) (prog-if ff) state is reboot the host box What I mean by random is that, sometimes it will be the first time I attached a GPU (like in this case). Other times it will be a different one. For example, I have tried attaching 2 of the devices to one VM, one GPU will attach properly, the other will not and go into the Unknown header type 7f state. I have also tried to attach 4 GPUs, 3 GPU will work and the 4th will fail. Different devices will fail and succeed (e.g. device 10:00.0 failed this time and 0f:00.0 succeeded, where as if I attach 2x GPUs to a VM, it will be reversed, 0f:00.0 will fail and 10:00.0 will succeed), so I don't feel it is hardware related. Any suggestions on debugging a !!! Unknown header type 7f? Thanks, -Kevin
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users