Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'George Bosilca' via Open MPI users
You could try running with `-x UCC_LOG_LEVEL=info` (add this to your mpirun command) to get additional info from the UCC initialization steps. However, your initial configuration parameters for Open MPI does not indicate it was built with UCC support. Where did you find the configure options ? G

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Collin Strassburger' via Open MPI users
Hello Howard, Thanks for the info! I’ll look into getting in touch with the groups you mentioned 😊 Warm regards, Collin Strassburger (he/him) From: 'Pritchard Jr., Howard' via Open MPI users Sent: Tuesday, December 9, 2025 4:24 PM To: [email protected] Subject: Re: [EXTERNAL] [OMPI user

Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Pritchard Jr., Howard' via Open MPI users
Hi Collin, Well, I would hope that at scale UCC (10s of nodes) would provide some benefit. I’d suggest getting in touch with someone on the Nvidia payroll to figure out what may be going on with UCC initialization on your system. Or there’s a UCX mail list that has a UCC WG community that may b

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Collin Strassburger' via Open MPI users
Hello Howard, Running with export OMPI_MCA_coll=^ucc resulted in a working run of the code! Are there any downsides to using OMPI_MCA_coll=^ucc to side-step this issue? Warm regards, Collin Strassburger (he/him) From: 'Pritchard Jr., Howard' via Open MPI users Sent: Tuesday, December 9, 2025

Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Pritchard Jr., Howard' via Open MPI users
Hi Collin, This is much more helpful. Let’s first try to turn off “optimizations”. Could you return with the following MCA param set? export OMPI_MCA_coll=^ucc and see if that helps? Also this points to possible problems with your system’s IB network setup. Howard From: 'Collin Strassburger

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Collin Strassburger' via Open MPI users
Hit “enter” a little too soon. Here’s the rest that was intended to be included: (gdb) bt full #0 __GI___pthread_mutex_unlock_usercnt (decr=1, mutex=) at ./nptl/pthread_mutex_unlock.c:72 type = type = __PRETTY_FUNCTION__ = "__pthread_mutex_unlock_usercnt" __valu

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Collin Strassburger' via Open MPI users
Hello Howard, This is the output I get from attaching gdb to it from the 2nd host (mpirun --host hades1,hades2 /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c): gdb /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c 525423 [generic gdb intro text] For help, type "help". Type "apropo

Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Pritchard Jr., Howard' via Open MPI users
Hello Collin, If you can do it, could you try to ssh into one of the nodes where a hello_c process is running and attach to it with a debugger and get a traceback? Howard From: 'Collin Strassburger' via Open MPI users Reply-To: "[email protected]" Date: Tuesday, December 9, 2025 at 1:1

[OMPI users] Multi-host troubleshooting

2025-12-09 Thread 'Collin Strassburger' via Open MPI users
Hello, I am dealing with an odd mpi issue that I am unsure how to continue diagnosing. Following the outline given by: https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems, steps 1-3 complete without any issues i.e. ssh remotehost hostname works Paths include the nvidia h