I am not sure why you are seeing this.  One thing that is clear is that you 
have found a bug in the error reporting.  The error message is a little garbled 
and I see a bug in what we are reporting. I will fix that.

If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My 
expectation is that you will not see any errors, but may lose some performance.

What does your hardware configuration look like?  Can you send me output from 
"nvidia-smi topo -m"

Thanks,
Rolf

>-----Original Message-----
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 6:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>1.8.5 with CUDA 7.0 and Multi-Process Service
>
>I'm encountering intermittent errors while trying to use the Multi-Process
>Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
>by multiple MPI processes that perform GPU-to-GPU communication with
>each other (i.e., GPU pointers are passed to the MPI transmission primitives).
>I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
>which is in turn built against CUDA 7.0. In my current configuration, I have 4
>MPS server daemons running, each of which controls access to one of 4 GPUs;
>the MPI processes spawned by my program are partitioned into 4 groups
>(which might contain different numbers of processes) that each talk to a
>separate daemon. For certain transmission patterns between these
>processes, the program runs without any problems. For others (e.g., 16
>processes partitioned into 4 groups), however, it dies with the following 
>error:
>
>[node05:20562] Failed to register remote memory, rc=-1
>--------------------------------------------------------------------------
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   21199360
>  address: 0x1
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--------------------------------------------------------------------------
>[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file
>pml_ob1_recvreq.c at line 477
>-------------------------------------------------------
>Child job 2 terminated normally, but 1 process returned a non-zero exit code..
>Per user-direction, the job has been aborted.
>-------------------------------------------------------
>[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
>mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
>[node05:20564] Failed to register remote memory, rc=-1 [node05:20564]
>[[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20566] Failed to register remote memory, rc=-1 [node05:20566]
>[[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20567] Failed to register remote memory, rc=-1 [node05:20567]
>[[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
>mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>[node05:20569] Failed to register remote memory, rc=-1 [node05:20569]
>[[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20571] Failed to register remote memory, rc=-1 [node05:20571]
>[[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20572] Failed to register remote memory, rc=-1 [node05:20572]
>[[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>
>After the above error occurs, I notice that /dev/shm/ is littered with
>cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
>but that doesn't seem to have any effect upon the problem. Rebooting the
>machine also doesn't have any effect. I should also add that my program runs
>without any error if the groups of MPI processes talk directly to the GPUs
>instead of via MPS.
>
>Does anyone have any ideas as to what could be going on?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>_______________________________________________
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/05/26881.php
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to