Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2.  I'm 
calling collectives like MPI_Allreduce on buffers that have already been 
shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.

On these buffers, I receive the following message and some communication 
sizes fail:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x147d54000068
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

If I pass in the two mca parameters to disable OpenMPI IPC, everything 
works.

I'm wondering two things:
Is this failure to handle IPC buffers in OpenMPI 4 a known issue?
When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI still use 
GPUDirect RDMA?

Thanks,

Mike Adams

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to