Alex, Thank you for your reply and all of your ideas. You are right that the SXM uses NVLink - I had not thought of that as a potential culprit. I do not have any PCIe GPUs in this cluster, but I may be able to setup a standalone test on an older box.
I have not seen a specific mention from NVIDIA regarding VFIO support for this form factor of the Tesla V100, but there were talks at GTC regarding using Tesla cards with VFIO. Do you know of a better guide you could point me to for getting up and running with VFIO? I was thinking that it felt like a permissions issue (as I can query the device, but not write to it), so it could be an issue with how it had me set up the ACLs... That is great to know of that the CentOS 7.4 kernel does support virqfd - from what I had read, it seemed like it should be there, but decided to try upgrading the kernel anyway just in case that was the issue. Thank you, Andy From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Tuesday, May 1, 2018 10:38 AM To: Andrew Zimmerman <a...@rincon.com> Cc: 'vfio-users@redhat.com' <vfio-users@redhat.com> Subject: Re: [vfio-users] cudaErrorDevicesUnavailalbe using Cuda in KVM using VFIO device passthrough On Tue, 1 May 2018 00:37:36 +0000 Andrew Zimmerman <a...@rincon.com<mailto:a...@rincon.com>> wrote: > I have a system with 4 Tesla V100-SXM-16GB GPUs in it, and I am SXM is the NVLink variant, right? VFIO has no special knowledge or handling of NVLink and I'd only consider it supported so far as it behaves similarly to PCIe. A particular concern of NVLink is how the mesh nature of the interconnect makes use of and enforces IOMMU-based translations necessary for device isolation and assignment, but we can't know this because it's proprietary. Does NVIDIA claim that VFIO device assignment is supported for these GPUs? > attempting to pass these devices through to virtual machines run by > KVM. I am managing the VMs with OpenNebula and I have followed the > instructions at > https://docs.opennebula.org/5.4/deployment/open_cloud_host_setup/pci_passthrough.html<https://docs.opennebula.org/5.4/deployment/open_cloud_host_setup/pci_passthrough.html> > to pass the device through to my VM. I am able to see the device in > nvidia-smi, watch its power/temperature levels, change the > persistence mode and compute mode, etc. Ugh, official documentation that recommends the vfio-bind script and manually modifying libvirt's ACL list. I'd be suspicious of any device assignment support making use of those sorts of instructions. > I can query the device to get properties and capabilities, but when I > try to run a program on it that utilizes the device (beyond > querying), I receive an error message about the device being > unavailable. To test, I am using simpleAtopmicIntrinsics out of the > CUDA Samples. Here is the output I receive: > > SimpleAtomicIntrinsics starting... > > GPU Device 0: "Tesla V100-SXM2-16GB": with compute capability 7.0 > > > GPU device has 80 Multi-Processors, SM 7.0 compute capabilities > > Cuda error at simpleAtomicIntrinsics.cu:108 > code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **) &dOData, > memsize)" > > I have tried this with multiple devices (in case there was an issue > with vfio on the first device) and had the same result on each of > them. Have you tried with a PCIe Tesla? > The host OS is CentOS 7.4.1708. I upgraded the kernel to 4.15.15-1 > from the elrepo to ensure that I had support for vfio_virqfd. I am > running the NVIDIA 390.15 driver and using cuda 9.1 > (cuda-9-1-9.1.85-1.x86_64 rpm). vfio_virqfd is just an artifact of OpenNebula's apparent terrible handling of device assignment. virqfd is there in a RHEL/Centos 7.4 kernel, but it may not be a separate module and it's not necessary to load it via dracut as indicated in their guide, only to blacklist nouveau. > Does anyone have ideas on what could be causing this or what I could > try next? I think you're in uncharted territory with NVLink based GPUs and not quite standard device assignment support in your chosen distro. I'd start with testing whether the test program works with PCIe GPUs to eliminate the interconnect issue. Thanks, Alex
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users