Hi, I'm not sure if it's a known issue, in v4.0 possibly, not sure about v4.1 or v5.0 - can you try? As far as CUDA IPC - how are you disabling it? I don't remember the mca params in v4.0 If it's either through pml ucx, or smcuda then no, it won't use it. -Tommy
On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote: > Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2. I'm > calling collectives like MPI_Allreduce on buffers that have already been > shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle. > > On these buffers, I receive the following message and some communication > sizes fail: > > -------------------------------------------------------------------------- > The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol > cannot be used. > cuIpcGetMemHandle return value: 1 > address: 0x147d54000068 > Check the cuda.h file for what the return value means. Perhaps a reboot > of the node will clear the problem. > -------------------------------------------------------------------------- > > If I pass in the two mca parameters to disable OpenMPI IPC, everything > works. > > I'm wondering two things: > Is this failure to handle IPC buffers in OpenMPI 4 a known issue? > When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI still > use GPUDirect RDMA? > > Thanks, > > Mike Adams > To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.