Dmitry, 

I'm not too familiar with the internals of OpenMPI, but I just tried 4.1.5 
on NCSA Delta and received the same IPC errors (no mca flags switched).  
The actual calls didn't fail this time to perform the actual operation, so 
maybe that's an improvement from v4.0.x to v4.1.x?

Thanks,

Mike Adams

On Friday, May 30, 2025 at 11:21:16 AM UTC-6 Dmitry N. Mikushin wrote:

> There is a relevant explanation of the same issue reported for Julia: 
> https://github.com/JuliaGPU/CUDA.jl/issues/1053
>
> пт, 30 мая 2025 г. в 19:05, Mike Adams <mikeca...@gmail.com>:
>
>> Hi Tommy,
>>
>> I'm setting btl_smcuda_use_cuda_ipc_same_gpu 0 and 
>> btl_smcuda_use_cuda_ipc 0. 
>> So, are you saying that with these params, it is also not using GPUDirect 
>> RDMA?
>>
>> PSC Bridges 2 only has v4 OpenMPI, but they may be working on installing 
>> v5 now.  Everything works on v5 on NCSA Delta - I'll try to test on an 
>> older OpenMPI.
>>
>> Mike Adams
>> On Friday, May 30, 2025 at 10:54:23 AM UTC-6 Tomislav Janjusic US wrote:
>>
>>> Hi,
>>>
>>> I'm not sure if it's a known issue, in v4.0 possibly, not sure about 
>>> v4.1 or v5.0 - can you try?
>>> As far as CUDA IPC - how are you disabling it? I don't remember the mca 
>>> params in v4.0
>>> If it's either through pml ucx, or smcuda then no, it won't use it.
>>> -Tommy
>>>
>>>
>>> On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote:
>>>
>>>> Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2.  I'm 
>>>> calling collectives like MPI_Allreduce on buffers that have already been 
>>>> shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.
>>>>
>>>> On these buffers, I receive the following message and some 
>>>> communication sizes fail:
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
>>>> cannot be used.
>>>>   cuIpcGetMemHandle return value:   1
>>>>   address: 0x147d54000068
>>>> Check the cuda.h file for what the return value means. Perhaps a reboot
>>>> of the node will clear the problem.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> If I pass in the two mca parameters to disable OpenMPI IPC, everything 
>>>> works.
>>>>
>>>> I'm wondering two things:
>>>> Is this failure to handle IPC buffers in OpenMPI 4 a known issue?
>>>> When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI still 
>>>> use GPUDirect RDMA?
>>>>
>>>> Thanks,
>>>>
>>>> Mike Adams
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send 
>> an email to users+un...@lists.open-mpi.org.
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to