-----Original Message-----
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 10:25 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
>> >-----Original Message-----
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Tuesday, May 19, 2015 6:30 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>> >1.8.5 with CUDA 7.0 and Multi-Process Service
>> >
>> >I'm encountering intermittent errors while trying to use the
>> >Multi-Process Service with CUDA 7.0 for improving concurrent access
>> >to a Kepler K20Xm GPU by multiple MPI processes that perform
>> >GPU-to-GPU communication with each other (i.e., GPU pointers are
>passed to the MPI transmission primitives).
>> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI
>> >1.8.5, which is in turn built against CUDA 7.0. In my current
>> >configuration, I have 4 MPS server daemons running, each of which
>> >controls access to one of 4 GPUs; the MPI processes spawned by my
>> >program are partitioned into 4 groups (which might contain different
>> >numbers of processes) that each talk to a separate daemon. For
>> >certain transmission patterns between these processes, the program
>> >runs without any problems. For others (e.g., 16 processes partitioned into
>4 groups), however, it dies with the following error:
>> >
>> >[node05:20562] Failed to register remote memory, rc=-1
>> >---------------------------------------------------------------------
>> >----- The call to cuIpcOpenMemHandle failed. This is an unrecoverable
>> >error and will cause the program to abort.
>> >  cuIpcOpenMemHandle return value:   21199360
>> >  address: 0x1
>> >Check the cuda.h file for what the return value means. Perhaps a
>> >reboot of the node will clear the problem.
>
>(snip)
>
>> >After the above error occurs, I notice that /dev/shm/ is littered
>> >with
>> >cuda.shm.* files. I tried cleaning up /dev/shm before running my
>> >program, but that doesn't seem to have any effect upon the problem.
>> >Rebooting the machine also doesn't have any effect. I should also add
>> >that my program runs without any error if the groups of MPI processes
>> >talk directly to the GPUs instead of via MPS.
>> >
>> >Does anyone have any ideas as to what could be going on?
>>
>> I am not sure why you are seeing this.  One thing that is clear is
>> that you have found a bug in the error reporting.  The error message
>> is a little garbled and I see a bug in what we are reporting. I will fix 
>> that.
>>
>> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc
>> 0.  My expectation is that you will not see any errors, but may lose
>> some performance.
>
>The error does indeed go away when IPC is disabled, although I do want to
>avoid degrading the performance of data transfers between GPU memory
>locations.
>
>> What does your hardware configuration look like?  Can you send me
>> output from "nvidia-smi topo -m"
>--

I see that you mentioned you are starting 4 MPS daemons.  Are you following the 
instructions here?

http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
 

This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA 
IPC. Since you are using CUDA 7 there is no more need to start multiple 
daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single MPS 
control daemon which will handle all GPUs.  Can you try that?  Because of this 
question, we realized we need to update our documentation as well.

Thanks,
Rolf


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to