On Tue, 1 May 2018 00:37:36 +0000 Andrew Zimmerman <a...@rincon.com> wrote:
> I have a system with 4 Tesla V100-SXM-16GB GPUs in it, and I am SXM is the NVLink variant, right? VFIO has no special knowledge or handling of NVLink and I'd only consider it supported so far as it behaves similarly to PCIe. A particular concern of NVLink is how the mesh nature of the interconnect makes use of and enforces IOMMU-based translations necessary for device isolation and assignment, but we can't know this because it's proprietary. Does NVIDIA claim that VFIO device assignment is supported for these GPUs? > attempting to pass these devices through to virtual machines run by > KVM. I am managing the VMs with OpenNebula and I have followed the > instructions at > https://docs.opennebula.org/5.4/deployment/open_cloud_host_setup/pci_passthrough.html > to pass the device through to my VM. I am able to see the device in > nvidia-smi, watch its power/temperature levels, change the > persistence mode and compute mode, etc. Ugh, official documentation that recommends the vfio-bind script and manually modifying libvirt's ACL list. I'd be suspicious of any device assignment support making use of those sorts of instructions. > I can query the device to get properties and capabilities, but when I > try to run a program on it that utilizes the device (beyond > querying), I receive an error message about the device being > unavailable. To test, I am using simpleAtopmicIntrinsics out of the > CUDA Samples. Here is the output I receive: > > SimpleAtomicIntrinsics starting... > > GPU Device 0: "Tesla V100-SXM2-16GB": with compute capability 7.0 > > > GPU device has 80 Multi-Processors, SM 7.0 compute capabilities > > Cuda error at simpleAtomicIntrinsics.cu:108 > code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **) &dOData, > memsize)" > > I have tried this with multiple devices (in case there was an issue > with vfio on the first device) and had the same result on each of > them. Have you tried with a PCIe Tesla? > The host OS is CentOS 7.4.1708. I upgraded the kernel to 4.15.15-1 > from the elrepo to ensure that I had support for vfio_virqfd. I am > running the NVIDIA 390.15 driver and using cuda 9.1 > (cuda-9-1-9.1.85-1.x86_64 rpm). vfio_virqfd is just an artifact of OpenNebula's apparent terrible handling of device assignment. virqfd is there in a RHEL/Centos 7.4 kernel, but it may not be a separate module and it's not necessary to load it via dracut as indicated in their guide, only to blacklist nouveau. > Does anyone have ideas on what could be causing this or what I could > try next? I think you're in uncharted territory with NVLink based GPUs and not quite standard device assignment support in your chosen distro. I'd start with testing whether the test program works with PCIe GPUs to eliminate the interconnect issue. Thanks, Alex _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users