I’ve been running a dual gaming VM rig (2x dedicated GPU) for a little
bit now, and everything works perfectly except when both VMs are under
load, after an hour or so I get a hard crash and/or reboot. It will
either reboot itself, or will hang so bad the physical ‘reset’ button on
the box doesnt work.

There is 0 evidence in the linux logs about the crash, I literally just
see one of a few standard cron jobs as the syslog, then the next line is
the kernel boot/start-up. Only real evidence I get is that- rarely I can
hear windows crash first. Or windows will crash and Ill get maybe
another second or 2 of ’top’ before the whole system goes down. I find
it extremely odd that there’s some sort of (albeit fast) degradation,
but absolutely nothing interesting in the logs.
This seems very similar to the issues I have. The system freezes with no log available for this specific problem, and there's hardly any sign anticipating it. Any device, including wifi, USB, SATA disks, etc is as powered off / severed from the machine. My only solution is to use the reset button - which causes a rather long reboot.

So, I’m pretty sure it’s something hardware related- either PSU or my
mobo is crap and is underpowered somewhere. During load, there are about
5 drives, 2 GTX GPUs, and GBe (~200mbps) all under constant load, so it
seems likely it could be something chipset related.
The setup is similar, with 6 drives (though I mainly use two for the VM, a rotational HD and a SSD for caching. The bcache device is passed as a disk to the VM) and two GTX GPUs. I'm sure it's not a power issue as removing one from the motherboard and thus not using passtrough for graphics still gets me the issue. The load that causes me to crash is usually heavy I/O on the VM disk

*So my question is really: is there ANY kind of kernel/vfio software
level issue that could cause this crash? Or does this just sound like
hardware?* I’ve tried several different power configurations at this
point, I just want to be as sure as possible it’s hardware before i
start replacing more things =\
I'm starting to think it's motherboard related, as it doesn't make sense that only few people get issues. Perhaps correlating onboard components could pin it down to something more specific

This is an up to date Ubuntu Xenial, not really running anything
special. I’ve gotten away with running my VMs almost as pure as
possible, no funny workarounds or anything. OVMF, Windows 10, hyper-v
flags. Skylake i7 @ z170M.
Both Xenial and Wily have this issue for me. Using a X99-Deluxe from ASUS with a i7-5930K. No kernel patching, default libvirt and qemu-kvm packets, default setup.

FWIW I used to isolate 6 of the 12 logical cores of my processor and to pin the vCPUs to them. I haven't seen the host choking, not tuning just gets a slightly worse performance on the VM

 Zycorax Tokoroa

_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users

Reply via email to