I think I've found the cause of this problem. Long story short it's caused by vfio's use of the D3hot pci power state.
By comparing traces of failures with successful boots I noticed the pci config space looked different. I started saving dumps of the config space for later comparison. Every time the host reboots I got a new config space state. Some of these states didn't work well and caused the VM hangs. Vfio resets the device both before and after use but that sometimes doesn't help so I guess the rumours about reset bugs with nvidia hardware is true. http://sprunge.us/KMPS this is what the config space looks like just after an unsuccessful boot (and vfio reset attempt). The BARs got an address of 0 in the hexdump and lspci displays them as "virtual". That seems the be the main difference between a successful boot and failure. I tried to understand why the pci config space was different each boot and remembered I had to use vfio-pci.disable_idle_d3=1 on my old Radeon HD 6770 because the state was corrupted after entering D3. This new nvidia card is not corrupted as badly and sometimes works even with D3. After enabling vfio-pci.disable_idle_d3=1 the pci config space is the same on every boot and the VM always starts like it should. I find it odd that no one else has noticed this problem before. Perhaps my motherboard has problems with D3 state? _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users