add --mca pml_base_verbose 90 And should see something like this: [rock18:3045236] select: component ucx selected [rock18:3045236] select: component ob1 not selected / finalized Or whatever your ompi instance selected. -Tommy On Tuesday, June 3, 2025 at 12:44:00 PM UTC-5 Mike Adams wrote:
> mpirun --mca btl_smcuda_use_cuda_ipc_same_gpu > 0 --mca btl_smcuda_use_cuda_ipc 0 --map-by ppr:2:numa --bind-to core > --rank-by slot --display-map --display-allocation --report-bindings > ./multilane_ring_allreduce > where there is 1 GPU per NUMA region. > > I am not sure which pml I'm using, but since those parameters cause my > program to succeed on OpenMPI 4.0.5 on PSC Bridges 2, I guess it is not > UCX. Can you point me in the right direction to determine the pml in use? > > Thank you for your assistance! > > Mike Adams > > On Tuesday, June 3, 2025 at 8:03:16 AM UTC-6 Tomislav Janjusic US wrote: > >> Can you post the full mpirun command? or at least the relevant mpi mca >> params? >> >> " I'm still curious about your input on whether or not those mca >> parameters I mentioned yesterday are disabling GPUDirect RDMA as well?" >> Even if you disable sm_cuda_ipc, it's possible you're still using cuda >> ipc via ucx for example. >> The mentioned mca params disable it for sm_cuda btl, but UCX doesn't use >> smcuda as a transport so it's irrelevant for ucx pml. >> Do you know which pml you're using? >> -Tommy >> >> >> On Saturday, May 31, 2025 at 1:26:58 PM UTC-5 Mike Adams wrote: >> >>> Interestingly, I made an error - Delta on 4.1.5 did fail like some of >>> the cases on Bridges2 on 4.0.5, but at 16 ranks per GPU. This is the core >>> count of the AMD processor on Delta with 4 GPUs. So, it looks like >>> Bridges2 needs an OpenMPI upgrade. >>> >>> Tommy, I'm still curious about your input on whether or not those mca >>> parameters I mentioned yesterday are disabling GPUDirect RDMA as well? >>> >>> Thank you both for your help! >>> >>> Mike Adams >>> >>> On Friday, May 30, 2025 at 11:39:49 AM UTC-6 Mike Adams wrote: >>> >>>> Dmitry, >>>> >>>> I'm not too familiar with the internals of OpenMPI, but I just tried >>>> 4.1.5 on NCSA Delta and received the same IPC errors (no mca flags >>>> switched). The actual calls didn't fail this time to perform the actual >>>> operation, so maybe that's an improvement from v4.0.x to v4.1.x? >>>> >>>> Thanks, >>>> >>>> Mike Adams >>>> >>>> On Friday, May 30, 2025 at 11:21:16 AM UTC-6 Dmitry N. Mikushin wrote: >>>> >>>>> There is a relevant explanation of the same issue reported for Julia: >>>>> https://github.com/JuliaGPU/CUDA.jl/issues/1053 >>>>> >>>>> пт, 30 мая 2025 г. в 19:05, Mike Adams <mikeca...@gmail.com>: >>>>> >>>>>> Hi Tommy, >>>>>> >>>>>> I'm setting btl_smcuda_use_cuda_ipc_same_gpu 0 and >>>>>> btl_smcuda_use_cuda_ipc 0. >>>>>> So, are you saying that with these params, it is also not using >>>>>> GPUDirect RDMA? >>>>>> >>>>>> PSC Bridges 2 only has v4 OpenMPI, but they may be working on >>>>>> installing v5 now. Everything works on v5 on NCSA Delta - I'll try to >>>>>> test >>>>>> on an older OpenMPI. >>>>>> >>>>>> Mike Adams >>>>>> On Friday, May 30, 2025 at 10:54:23 AM UTC-6 Tomislav Janjusic US >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm not sure if it's a known issue, in v4.0 possibly, not sure about >>>>>>> v4.1 or v5.0 - can you try? >>>>>>> As far as CUDA IPC - how are you disabling it? I don't remember the >>>>>>> mca params in v4.0 >>>>>>> If it's either through pml ucx, or smcuda then no, it won't use it. >>>>>>> -Tommy >>>>>>> >>>>>>> >>>>>>> On Saturday, May 24, 2025 at 8:56:50 AM UTC-7 Mike Adams wrote: >>>>>>> >>>>>>>> Hi, I'm using OpenMPI 4.0.5 with CUDA support on PSC Bridges-2. >>>>>>>> I'm calling collectives like MPI_Allreduce on buffers that have >>>>>>>> already >>>>>>>> been shared between ranks via cudaIpcGetMemHandle/cudaIpcOpenMemHandle. >>>>>>>> >>>>>>>> On these buffers, I receive the following message and some >>>>>>>> communication sizes fail: >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> The call to cuIpcGetMemHandle failed. This means the GPU RDMA >>>>>>>> protocol >>>>>>>> cannot be used. >>>>>>>> cuIpcGetMemHandle return value: 1 >>>>>>>> address: 0x147d54000068 >>>>>>>> Check the cuda.h file for what the return value means. Perhaps a >>>>>>>> reboot >>>>>>>> of the node will clear the problem. >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> If I pass in the two mca parameters to disable OpenMPI IPC, >>>>>>>> everything works. >>>>>>>> >>>>>>>> I'm wondering two things: >>>>>>>> Is this failure to handle IPC buffers in OpenMPI 4 a known issue? >>>>>>>> When I disable OpenMPI CUDA IPC with mca parameters, does OpenMPI >>>>>>>> still use GPUDirect RDMA? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Mike Adams >>>>>>>> >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to users+un...@lists.open-mpi.org. >>>>>> >>>>> To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.