I have a system with 4 Tesla V100-SXM-16GB GPUs in it, and I am attempting to 
pass these devices through to virtual machines run by KVM. I am managing the 
VMs with OpenNebula and I have followed the instructions at 
https://docs.opennebula.org/5.4/deployment/open_cloud_host_setup/pci_passthrough.html
 to pass the device through to my VM. I am able to see the device in 
nvidia-smi, watch its power/temperature levels, change the persistence mode and 
compute mode, etc.

I can query the device to get properties and capabilities, but when I try to 
run a program on it that utilizes the device (beyond querying), I receive an 
error message about the device being unavailable.
To test, I am using simpleAtopmicIntrinsics out of the CUDA Samples. Here is 
the output I receive:

SimpleAtomicIntrinsics starting...

GPU Device 0: "Tesla V100-SXM2-16GB": with compute capability 7.0

> GPU device has 80 Multi-Processors, SM 7.0 compute capabilities

Cuda error at simpleAtomicIntrinsics.cu:108 
code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **) &dOData, memsize)"

I have tried this with multiple devices (in case there was an issue with vfio 
on the first device) and had the same result on each of them.

The host OS is CentOS 7.4.1708. I upgraded the kernel to 4.15.15-1 from the 
elrepo to ensure that I had support for vfio_virqfd.
I am running the NVIDIA 390.15 driver and using cuda 9.1 
(cuda-9-1-9.1.85-1.x86_64 rpm).

Does anyone have ideas on what could be causing this or what I could try next?

Thank you for your help and ideas,
Andy
_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://www.redhat.com/mailman/listinfo/vfio-users

Reply via email to