Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT: > > >-----Original Message----- > >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon > >Sent: Tuesday, May 19, 2015 6:30 PM > >To: us...@open-mpi.org > >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI > >1.8.5 with CUDA 7.0 and Multi-Process Service > > > >I'm encountering intermittent errors while trying to use the Multi-Process > >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU > >by multiple MPI processes that perform GPU-to-GPU communication with > >each other (i.e., GPU pointers are passed to the MPI transmission > >primitives). > >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, > >which is in turn built against CUDA 7.0. In my current configuration, I have > >4 > >MPS server daemons running, each of which controls access to one of 4 GPUs; > >the MPI processes spawned by my program are partitioned into 4 groups > >(which might contain different numbers of processes) that each talk to a > >separate daemon. For certain transmission patterns between these > >processes, the program runs without any problems. For others (e.g., 16 > >processes partitioned into 4 groups), however, it dies with the following > >error: > > > >[node05:20562] Failed to register remote memory, rc=-1 > >-------------------------------------------------------------------------- > >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and > >will cause the program to abort. > > cuIpcOpenMemHandle return value: 21199360 > > address: 0x1 > >Check the cuda.h file for what the return value means. Perhaps a reboot of > >the node will clear the problem.
(snip) > >After the above error occurs, I notice that /dev/shm/ is littered with > >cuda.shm.* files. I tried cleaning up /dev/shm before running my program, > >but that doesn't seem to have any effect upon the problem. Rebooting the > >machine also doesn't have any effect. I should also add that my program runs > >without any error if the groups of MPI processes talk directly to the GPUs > >instead of via MPS. > > > >Does anyone have any ideas as to what could be going on? > > I am not sure why you are seeing this. One thing that is clear is that you > have found a bug in the error reporting. The error message is a little > garbled and I see a bug in what we are reporting. I will fix that. > > If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My > expectation is that you will not see any errors, but may lose some > performance. > > What does your hardware configuration look like? Can you send me output from > "nvidia-smi topo -m" GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PHB SOC SOC 0-23 GPU1 PHB X SOC SOC 0-23 GPU2 SOC SOC X PHB 0-23 GPU3 SOC SOC PHB X 0-23 Legend: X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/