I recently noticed the following error when running a Python program I'm
developing that repeatedly performs GPU-to-GPU data transfers via OpenMPI:

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x602e75000
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.

The system is running Ubuntu 14.04.3 and contains several Tesla S2050 GPUs. I'm
using the following software:

- Linux kernel 3.19.0 (backported to Ubuntu 14.04.3 from 15.04)
- CUDA 7.0 (installed via NVIDIA's deb packages)
- NVIDIA kernel driver 346.82
- OpenMPI 1.10.0 (manually compiled with CUDA support) 
- Python 2.7.10
- pycuda 2015.1.3 (manually compiled against CUDA 7.0)
- mpi4py (manually compiled git revision 1d8ab22)

OpenMPI, Python, pycuda, and mpi4py are all locally installed in a conda
environment.

Judging from my program's logs, the error pops up during one of the program's
first few iterations. The error isn't fatal, however - the program continues
running to completion after the message appears.  Running mpiexec with --mca
plm_base_verbose 10 doesn't seem to produce any additional debug info of use in
tracking this down.  I did notice, though, that there are undeleted cuda.shm.*
files in /run/shm after the error message appears and my program
exits. Deleting the files does not prevent the error from recurring if I
subsequently rerun the program.

Oddly, the above problem doesn't crop up when I run the same code on an Ubuntu
14.04.3 system with the exact same software containing 2 non-Tesla GPUs
(specifically, a GTX 470 and 750). The error seems to have started occurring
over the past two weeks, but none of the changes I made to my code over that
time seem to be related to the problem (i.e., running an older revision resulted
in the same errors). I also tried running my code using older releases of
OpenMPI (e.g., 1.8.5) and mpi4py (e.g., from about 4 weeks ago), but the error
message still occurs. Both Ubuntu systems are 64-bit and have been kept up to
date with the latest package updates.

Any thoughts as to what could be causing the problem? 
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to