Hello Howard, This is the output I get from attaching gdb to it from the 2nd host (mpirun --host hades1,hades2 /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c): gdb /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c 525423 [generic gdb intro text]
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c... Attaching to program: /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c, process 525423 [New LWP 525427] [New LWP 525426] --Type <RET> for more, q to quit, c to continue without paging-- [New LWP 525425] [New LWP 525424] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x000070fffef6b68f in opal_libevent2022_event_base_loop () from /opt/hpcx/ompi/lib/libopen-pal.so.40 Collin Strassburger (he/him) From: 'Pritchard Jr., Howard' via Open MPI users <[email protected]> Sent: Tuesday, December 9, 2025 3:27 PM To: [email protected] Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hello Collin, If you can do it, could you try to ssh into one of the nodes where a hello_c process is running and attach to it with a debugger and get a traceback? Howard From: 'Collin Strassburger' via Open MPI users <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, December 9, 2025 at 1:19 PM To: Open MPI Users <[email protected]<mailto:[email protected]>> Subject: [EXTERNAL] [OMPI users] Multi-host troubleshooting Hello, I am dealing with an odd mpi issue that I am unsure how to continue diagnosing. Following the outline given by: https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems<https://urldefense.com/v3/__https:/www.open-mpi.org/faq/?category=running*diagnose-multi-host-problems__;Iw!!Bt8fGhp8LhKGRg!DRYP94rDXIEmdklDrsyM2wrE-tkiXvRFPv3wNgs_QqJDo2u6ltIg7OwPot_gKi3FSnGX1TdHa9QBl6J5JTaV$>, steps 1-3 complete without any issues i.e. ssh remotehost hostname works Paths include the nvidia hpcx paths when checked both with ssh and mpirun Mpirun --host node1,node2 hostname works correctly Mpirun --host node1,node2 env | grep -i path yields identical paths which include the paths required by hpcx (This is all through passwordless login) Step 4 calls to run mpirun --hosts node1,node2 hello_c. I have locally compiled the code and confirmed that it works on each machine individually. The same code is shared between the machines. However, it does not run across multiple hosts at once. It simply hangs until Ctrl-C'd. I have attached the --mca plm_base_verbose 10 logs; while I don't see anything in them, I am not well versed enough in OpenMPI to think that I understand the full implications of it all. Notes: No firewall is present between the machines (minimal install is the base, so ufw and iptables are not present by default and have not yet been installed) Journalctl does not report any errors. The machines have identical hardware and utilized the same configuration script. Calling "mpirun --hosts node1,node2 mpirun --version" returns identical results Calling "mpirun --hosts node1,node2 env | grep -i path" returns identical results OS: Ubuntu 24.04 LTS OMPI: 4.1.7rc1 from Nvidia HPCX Configure options: --prefix=${HPCX_HOME}/ompi \ --with-hcoll=${HPCX_HOME}/hcoll \ --with-ucx=${HPCX_HOME}/ucx \ --with-platform=contrib/platform/mellanox/optimized \ --with-tm=/opt/pbs/ \ --with-slurm=no \ --with-pmix \ --with-hwloc=internal I'm rather at a loss on what to try/check next. Any thoughts on how to continue troubleshooting this issue? Warm regards, Collin Strassburger (he/him) ________________________________ The information contained in this e-mail and any attachments from Bihrle Applied Research may contain confidential and/or proprietary information, and is intended only for the named recipient to whom it was originally addressed. If you are not the intended recipient, any disclosure, distribution, or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by return e-mail and permanently delete the e-mail and any attachments. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
