Hi,

I'm not sure if it's a known issue, in v4.0 possibly, not sure about v4.1 
or v5.0 - can you try?
As far as CUDA IPC - how are you disabling it? I don't remember the mca 
params in v4.0
If it's either through pml ucx, or smcuda then no, it won't use it.
-Tommy


On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote:

> Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2.  I'm 
> calling collectives like MPI_Allreduce on buffers that have already been 
> shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.
>
> On these buffers, I receive the following message and some communication 
> sizes fail:
>
> --------------------------------------------------------------------------
> The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
> cannot be used.
>   cuIpcGetMemHandle return value:   1
>   address: 0x147d54000068
> Check the cuda.h file for what the return value means. Perhaps a reboot
> of the node will clear the problem.
> --------------------------------------------------------------------------
>
> If I pass in the two mca parameters to disable OpenMPI IPC, everything 
> works.
>
> I'm wondering two things:
> Is this failure to handle IPC buffers in OpenMPI 4 a known issue?
> When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI still 
> use GPUDirect RDMA?
>
> Thanks,
>
> Mike Adams
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to