Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2. I'm calling collectives like MPI_Allreduce on buffers that have already been shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.
On these buffers, I receive the following message and some communication sizes fail: -------------------------------------------------------------------------- The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol cannot be used. cuIpcGetMemHandle return value: 1 address: 0x147d54000068 Check the cuda.h file for what the return value means. Perhaps a reboot of the node will clear the problem. -------------------------------------------------------------------------- If I pass in the two mca parameters to disable OpenMPI IPC, everything works. I'm wondering two things: Is this failure to handle IPC buffers in OpenMPI 4 a known issue? When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI still use GPUDirect RDMA? Thanks, Mike Adams To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.