On Tue, 18 Oct 2016 17:48:59 -0500 Kevin Vasko <kva...@gmail.com> wrote:
> Alex, > > I think I was able to do it successfully and was scucessfully able to make > the thing fail. It went from (rev a1) to (rev ff) with response of the > header error. > > Instead of doing all devices I just did 1 at a time. > > this was the output of > > # lspci -tv > > +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+--00.0 NVIDIA Corporation > GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-04.0-[05]--+--00.0 NVIDIA > Corporation GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-08.0-[06]--+--00.0 NVIDIA > Corporation GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-0c.0-[07]--+--00.0 NVIDIA > Corporation GM200 [GeForce GTX TITAN X] > | \-00.1 > NVIDIA Corporation Device efb0 > +-14.0-[08]----00.0 Mellanox > Technologies MT27600 Family [ConnectX-3] > +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-11]--+--00.0-[0d]--+-00.0 > NVIDIA Corporation GM200 [GeForce GTX TITAN X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > +--04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN > X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > +--08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN > X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > +--0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN > X] > > | \-00.1 NVIDIA Corporation Device 0fb0 > > I tried the first device > # virsh nodedev-detach --driver=kvm pci_0000_04_00_0 > Device pci_0000_04_00_0 detached > > # virsh nodedev-detach --driver=kvm pci_0000_04_00_1 > Device pci_0000_04_00_1 detached > > In the script I put > > DEVS=( > 03:00.0 > 04 > ) > > Ran it 100 times and got no error. > > Ran it for a different device 05 > > > > # virsh nodedev-detach --driver=kvm pci_0000_05_00_0 > Device pci_0000_05_00_0 detached > > # virsh nodedev-detach --driver=kvm pci_0000_05_00_1 > Device pci_0000_05_00_1 detached > > DEVS=( > 03:04.0 > 05: > ) > > > I saw this. > > #: for i in $(seq 1 100); do ./reset.sh; done > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev a1) > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1) > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX > TITAN X] (rev ff) > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff) > > I repeated this with another device on the system. > > I assume this indicates that that the device is not resetting properly? The > question is where do I go from here? Would this indicate a problem with the > PCI Reset code or a problematic hardware? Right, the PCIe link is not coming back for some reason, that seems like a hardware issue. Can you attach the output of 'sudo lspci -vvvs 3:04.0' when you're in this state (replace with the appropriate parent bridge depending on the failed device), maybe we can see if that downstream port is stuck in training. What I would do next is to test each card repeatedly. Do only some cards fail? If so, swap a working card and a non-working card, does the failure follow the card or the slot? I'm not sure what the result is going to be, but if we can't rely on a PCI bus reset then you're really not going to have any repeat-ability with assigning the GPUs. Thanks, Alex _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users