I recently noticed the following error when running a Python program I'm developing that repeatedly performs GPU-to-GPU data transfers via OpenMPI:
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol cannot be used. cuIpcGetMemHandle return value: 1 address: 0x602e75000 Check the cuda.h file for what the return value means. Perhaps a reboot of the node will clear the problem. The system is running Ubuntu 14.04.3 and contains several Tesla S2050 GPUs. I'm using the following software: - Linux kernel 3.19.0 (backported to Ubuntu 14.04.3 from 15.04) - CUDA 7.0 (installed via NVIDIA's deb packages) - NVIDIA kernel driver 346.82 - OpenMPI 1.10.0 (manually compiled with CUDA support) - Python 2.7.10 - pycuda 2015.1.3 (manually compiled against CUDA 7.0) - mpi4py (manually compiled git revision 1d8ab22) OpenMPI, Python, pycuda, and mpi4py are all locally installed in a conda environment. Judging from my program's logs, the error pops up during one of the program's first few iterations. The error isn't fatal, however - the program continues running to completion after the message appears. Running mpiexec with --mca plm_base_verbose 10 doesn't seem to produce any additional debug info of use in tracking this down. I did notice, though, that there are undeleted cuda.shm.* files in /run/shm after the error message appears and my program exits. Deleting the files does not prevent the error from recurring if I subsequently rerun the program. Oddly, the above problem doesn't crop up when I run the same code on an Ubuntu 14.04.3 system with the exact same software containing 2 non-Tesla GPUs (specifically, a GTX 470 and 750). The error seems to have started occurring over the past two weeks, but none of the changes I made to my code over that time seem to be related to the problem (i.e., running an older revision resulted in the same errors). I also tried running my code using older releases of OpenMPI (e.g., 1.8.5) and mpi4py (e.g., from about 4 weeks ago), but the error message still occurs. Both Ubuntu systems are 64-bit and have been kept up to date with the latest package updates. Any thoughts as to what could be causing the problem? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/