Re: [OMPI users] Re: CUDA-Aware on OpenMPI v4 with CUDA IPC buffers

'Tomislav Janjusic US' via Open MPI users Tue, 03 Jun 2025 12:27:24 -0700

add --mca pml_base_verbose 90

And should see something like this:
[rock18:3045236] select: component ucx selected
[rock18:3045236] select: component ob1 not selected / finalized
Or whatever your ompi instance selected.
-Tommy
On Tuesday, June 3, 2025 at 12:44:00 PM UTC-5 Mike Adams wrote:


> mpirun --mca btl_smcuda_use_cuda_ipc_same_gpu 
> 0 --mca btl_smcuda_use_cuda_ipc 0 --map-by ppr:2:numa --bind-to core 
> --rank-by slot --display-map --display-allocation --report-bindings 
> ./multilane_ring_allreduce
> where there is 1 GPU per NUMA region.
>
> I am not sure which pml I'm using, but since those parameters cause my 
> program to succeed on OpenMPI 4.0.5 on PSC Bridges 2, I guess it is not 
> UCX.  Can you point me in the right direction to determine the pml in use?
>
> Thank you for your assistance!
>
> Mike Adams
>
> On Tuesday, June 3, 2025 at 8:03:16 AM UTC-6 Tomislav Janjusic US wrote:
>
>> Can you post the full mpirun command? or at least the relevant mpi mca 
>> params?
>>
>> " I'm still curious about your input on whether or not those mca 
>> parameters I mentioned yesterday are disabling GPUDirect RDMA as well?"
>> Even if you disable sm_cuda_ipc, it's possible you're still using cuda 
>> ipc via ucx for example.
>> The mentioned mca params disable it for sm_cuda btl, but UCX doesn't use 
>> smcuda as a transport so it's irrelevant for ucx pml.
>> Do you know which pml you're using?
>> -Tommy 
>>
>>
>> On Saturday, May 31, 2025 at 1:26:58 PM UTC-5 Mike Adams wrote:
>>
>>> Interestingly, I made an error - Delta on 4.1.5 did fail like some of 
>>> the cases on Bridges2 on 4.0.5, but at 16 ranks per GPU.  This is the core 
>>> count of the AMD processor on Delta with 4 GPUs.  So, it looks like 
>>> Bridges2 needs an OpenMPI upgrade.
>>>
>>> Tommy, I'm still curious about your input on whether or not those mca 
>>> parameters I mentioned yesterday are disabling GPUDirect RDMA as well?
>>>
>>> Thank you both for your help!
>>>
>>> Mike Adams
>>>
>>> On Friday, May 30, 2025 at 11:39:49 AM UTC-6 Mike Adams wrote:
>>>
>>>> Dmitry, 
>>>>
>>>> I'm not too familiar with the internals of OpenMPI, but I just tried 
>>>> 4.1.5 on NCSA Delta and received the same IPC errors (no mca flags 
>>>> switched).  The actual calls didn't fail this time to perform the actual 
>>>> operation, so maybe that's an improvement from v4.0.x to v4.1.x?
>>>>
>>>> Thanks,
>>>>
>>>> Mike Adams
>>>>
>>>> On Friday, May 30, 2025 at 11:21:16 AM UTC-6 Dmitry N. Mikushin wrote:
>>>>
>>>>> There is a relevant explanation of the same issue reported for Julia: 
>>>>> https://github.com/JuliaGPU/CUDA.jl/issues/1053
>>>>>
>>>>> пт, 30 мая 2025 г. в 19:05, Mike Adams <mikeca...@gmail.com>:
>>>>>
>>>>>> Hi Tommy,
>>>>>>
>>>>>> I'm setting btl_smcuda_use_cuda_ipc_same_gpu 0 and 
>>>>>> btl_smcuda_use_cuda_ipc 0. 
>>>>>> So, are you saying that with these params, it is also not using 
>>>>>> GPUDirect RDMA?
>>>>>>
>>>>>> PSC Bridges 2 only has v4 OpenMPI, but they may be working on 
>>>>>> installing v5 now.  Everything works on v5 on NCSA Delta - I'll try to 
>>>>>> test 
>>>>>> on an older OpenMPI.
>>>>>>
>>>>>> Mike Adams
>>>>>> On Friday, May 30, 2025 at 10:54:23 AM UTC-6 Tomislav Janjusic US 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm not sure if it's a known issue, in v4.0 possibly, not sure about 
>>>>>>> v4.1 or v5.0 - can you try?
>>>>>>> As far as CUDA IPC - how are you disabling it? I don't remember the 
>>>>>>> mca params in v4.0
>>>>>>> If it's either through pml ucx, or smcuda then no, it won't use it.
>>>>>>> -Tommy
>>>>>>>
>>>>>>>
>>>>>>> On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote:
>>>>>>>
>>>>>>>> Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2.  
>>>>>>>> I'm calling collectives like MPI_Allreduce on buffers that have 
>>>>>>>> already 
>>>>>>>> been shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle.
>>>>>>>>
>>>>>>>> On these buffers, I receive the following message and some 
>>>>>>>> communication sizes fail:
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> The call to cuIpcGetMemHandle failed. This means the GPU RDMA 
>>>>>>>> protocol
>>>>>>>> cannot be used.
>>>>>>>>   cuIpcGetMemHandle return value:   1
>>>>>>>>   address: 0x147d54000068
>>>>>>>> Check the cuda.h file for what the return value means. Perhaps a 
>>>>>>>> reboot
>>>>>>>> of the node will clear the problem.
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> If I pass in the two mca parameters to disable OpenMPI IPC, 
>>>>>>>> everything works.
>>>>>>>>
>>>>>>>> I'm wondering two things:
>>>>>>>> Is this failure to handle IPC buffers in OpenMPI 4 a known issue?
>>>>>>>> When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI 
>>>>>>>> still use GPUDirect RDMA?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Mike Adams
>>>>>>>>
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to users+un...@lists.open-mpi.org.
>>>>>>
>>>>>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Re: [OMPI users] Re: CUDA-Aware on OpenMPI v4 with CUDA IPC buffers

Reply via email to