Hi folks, Related to this, I wonder if anyone has ever seen something like a pci bus error on a GPU node...? We have a fleet of Dell R730s with dual K80s and we are periodically seeing the host reset with the hardware log recording a message like: "A fatal error was detected on a component at bus 4 device 8 function 0."
Which in this case refers to: $ lspci -t -d 10b5:8747 -+-[0000:82]---00.0-[83-85]--+-08.0-[84]-- | \-10.0-[85]-- +-[0000:03]---00.0-[04-06]--+-08.0-[05]-- | \-10.0-[06]-- One of the downstream(?) PCIe endpoint facing ports, i.e., the GPU side of the PCIe switch. This error causes the host to unceremoniously reset. No error to be found anywhere host side, just the hardware log. These are currently Ubuntu Trusty hosts with 4.4 kernel. GPU burn testing does not seem to trigger it and the host can go back into production and never (so far) see the issue again. But we've now seen this about 10 times over the last 12-18 months across a fleet of ~30 of these hosts (sometimes twice on the same host months apart, but several distinct hosts overall). Cheers, On 7 May 2017 at 07:55, Blair Bethwaite <blair.bethwa...@gmail.com> wrote: > Hi all, > > I've been (very slowly) working on some docs detailing how to setup an > OpenStack Nova Libvirt+QEMU-KVM deployment to provide GPU-accelerated > instances. In Boston I hope to chat to some of the docs team and > figure out an appropriate upstream guide to fit that into. One of the > things I'd like to provide is a community record (better than ML > archives) of what works and doesn't. I've started a first attempt at > collating some basics here: > https://etherpad.openstack.org/p/GPU-passthrough-model-success-failure > > I know there are at a least a few lurkers out there doing this too so > please share your own experience. Once there is a bit more data there > it probably makes sense to convert to a tabular format of some kind > (but wasn't immediately obvious to me how that should look given there > are several long list fields) > > -- > Cheers, > ~Blairo -- Cheers, ~Blairo _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators