Re: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle error emitted by OpenMPI

Rolf vandeVaart Thu, 3 Sep 2015 10:56:33 -0400 (EDT)

Lev:
Can you run with --mca mpi_common_cuda_verbose 100 --mca mpool_rgpusm_verbose 
100 and send me (rvandeva...@nvidia.com) the output of that.
Thanks,
Rolf


>-----Original Message-----
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Wednesday, September 02, 2015 7:15 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle
>error emitted by OpenMPI
>
>I recently noticed the following error when running a Python program I'm
>developing that repeatedly performs GPU-to-GPU data transfers via
>OpenMPI:
>
>The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
>cannot be used.
>  cuIpcGetMemHandle return value:   1
>  address: 0x602e75000
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>
>The system is running Ubuntu 14.04.3 and contains several Tesla S2050 GPUs.
>I'm using the following software:
>
>- Linux kernel 3.19.0 (backported to Ubuntu 14.04.3 from 15.04)
>- CUDA 7.0 (installed via NVIDIA's deb packages)
>- NVIDIA kernel driver 346.82
>- OpenMPI 1.10.0 (manually compiled with CUDA support)
>- Python 2.7.10
>- pycuda 2015.1.3 (manually compiled against CUDA 7.0)
>- mpi4py (manually compiled git revision 1d8ab22)
>
>OpenMPI, Python, pycuda, and mpi4py are all locally installed in a conda
>environment.
>
>Judging from my program's logs, the error pops up during one of the
>program's first few iterations. The error isn't fatal, however - the program
>continues running to completion after the message appears.  Running
>mpiexec with --mca plm_base_verbose 10 doesn't seem to produce any
>additional debug info of use in tracking this down.  I did notice, though, that
>there are undeleted cuda.shm.* files in /run/shm after the error message
>appears and my program exits. Deleting the files does not prevent the error
>from recurring if I subsequently rerun the program.
>
>Oddly, the above problem doesn't crop up when I run the same code on an
>Ubuntu
>14.04.3 system with the exact same software containing 2 non-Tesla GPUs
>(specifically, a GTX 470 and 750). The error seems to have started occurring
>over the past two weeks, but none of the changes I made to my code over
>that time seem to be related to the problem (i.e., running an older revision
>resulted in the same errors). I also tried running my code using older releases
>of OpenMPI (e.g., 1.8.5) and mpi4py (e.g., from about 4 weeks ago), but the
>error message still occurs. Both Ubuntu systems are 64-bit and have been
>kept up to date with the latest package updates.
>
>Any thoughts as to what could be causing the problem?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>_______________________________________________
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27526.php
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Re: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle error emitted by OpenMPI

Reply via email to