I spent some time troubleshooting, and I have things so that every use-case works except multi-seat. I’m not sure how long it would be stable for with both users in MS Office, but if one user does anything remotely 3D intensive, then the nvidia driver will crash in /both/ VMs within a minute or two, and within seconds of each other. The VMs must be hard shut down, despite a notification stating that the driver has recovered in some cases.
There is some variability to it. Sometimes they will crash at the desktop, before anything “3D intensive”, but this is rare. Usually one of them must be using its GPU to some significant extent. One VM at a time works fine, irrespective of what I do with it. Very well, actually. In addition, I can pass two GPUs to one VM, and I can run on benchmark on the 660 and two on the 970 (1080p and 4K/1080p, respectively), etc, and it will not crash for at least 8 hours at a time. That’s as long as I’ve run the tests for. The system is actually fairly useable throughout, despite CPU being at redline as well. I’ve made sure the RAM is good, and swapped the PSU for a bigger one. I give each VM 8 gb, and leave about 6 gb for the host. I have another motherboard I can repurpose as a test in the next few days, but in any case I’m pretty sure at this point I don’t have a hardware problem per se. There could still be some conflict with my host GPU/driver. I notice that on the boot display I see the vfio module get loaded, followed by “vga arb device changed decodes”… while still in the initrd, and then everything stops on that display. This is more or less what I expect to happen, but maybe I’m wrong to. I notice that I cannot pass the boot GPU to a VM. If I try, the screen goes from being frozen with the vfio output mentioned above, to idle, and stays that way. Obviously “something is wrong” there. Even though I don’t use that GPU with VMs, and despite having blacklisted nouvou/nvidia drivers, maybe it’s somehow related. Seems doubtful. On the software side, I can try another distro. Changing the motherboard or the distro are not really good solutions for me, however. I’m at something of an impasse, and I could use a suggestion. Thanks in advance, Brian From: Alex Williamson [mailto:alex.l.william...@gmail.com] Sent: Thursday, July 7, 2016 3:28 PM To: Torbjorn Jansson Cc: Brian Yglesias; vfio-users Subject: Re: [vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load On Thu, Jul 7, 2016 at 1:20 PM, Torbjorn Jansson <torbjorn.jans...@mbox200.swipnet.se <mailto:torbjorn.jans...@mbox200.swipnet.se> > wrote: On 2016-07-07 20:01, Brian Yglesias wrote: I've been trying to get GPU passthrough to work more reliably for a few days. I have an Asus Rampage III Forumula (X58 chipset LGA1366) with latest bios, Xeon X5670, kernel 4.4.13, quemu 2.5.1.1. I'm passing through a GTX 660 and a GTX 970, sometimes to two different VMs, and sometimes to the same one. i have a gtx970 and it works pretty well for gpu passthru. but i'm not so sure a 660 will work and i suspect you will have reset issues. Seems to be some growing FUD with nvidia and reset issues. AFAIK, there are no reset issues for Kepler and newer cards, including the 660. Fermi cards always seem to cause problems, but I don't necessarily think it's reset related. Reset problems on nvidia are more likely a result of trying to assign the primary host graphics or getting the card into a bad state with host graphics drivers. I have a GTX660, it doesn't get used often for this purpose but IIRC, it works just fine.
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users