Hello,
I am dealing with an odd mpi issue that I am unsure how to continue diagnosing.
Following the outline given by:
https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems,
steps 1-3 complete without any issues
i.e. ssh remotehost hostname works
Paths include the nvidia hpcx paths when checked both with ssh and mpirun
Mpirun --host node1,node2 hostname works correctly
Mpirun --host node1,node2 env | grep -i path yields identical paths which
include the paths required by hpcx
(This is all through passwordless login)
Step 4 calls to run mpirun --hosts node1,node2 hello_c. I have locally
compiled the code and confirmed that it works on each machine individually.
The same code is shared between the machines. However, it does not run across
multiple hosts at once. It simply hangs until Ctrl-C'd. I have attached the
--mca plm_base_verbose 10 logs; while I don't see anything in them, I am not
well versed enough in OpenMPI to think that I understand the full implications
of it all.
Notes:
No firewall is present between the machines (minimal install is the base, so
ufw and iptables are not present by default and have not yet been installed)
Journalctl does not report any errors.
The machines have identical hardware and utilized the same configuration script.
Calling "mpirun --hosts node1,node2 mpirun --version" returns identical results
Calling "mpirun --hosts node1,node2 env | grep -i path" returns identical
results
OS: Ubuntu 24.04 LTS
OMPI: 4.1.7rc1 from Nvidia HPCX
Configure options:
--prefix=${HPCX_HOME}/ompi \
--with-hcoll=${HPCX_HOME}/hcoll \
--with-ucx=${HPCX_HOME}/ucx \
--with-platform=contrib/platform/mellanox/optimized \
--with-tm=/opt/pbs/ \
--with-slurm=no \
--with-pmix \
--with-hwloc=internal
I'm rather at a loss on what to try/check next. Any thoughts on how to
continue troubleshooting this issue?
Warm regards,
Collin Strassburger (he/him)
________________________________
The information contained in this e-mail and any attachments from Bihrle
Applied Research may contain confidential and/or proprietary information, and
is intended only for the named recipient to whom it was originally addressed.
If you are not the intended recipient, any disclosure, distribution, or copying
of this e-mail or its attachments is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by return e-mail and
permanently delete the e-mail and any attachments.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
mpirun --mca plm_base_verbose 10 --host hades1,hades2 hostname
[hades1:06033] mca: base: components_register: registering framework plm
components
[hades1:06033] mca: base: components_register: found loaded component tm
[hades1:06033] mca: base: components_register: component tm register function
successful
[hades1:06033] mca: base: components_register: found loaded component isolated
[hades1:06033] mca: base: components_register: component isolated has no
register or open function
[hades1:06033] mca: base: components_register: found loaded component rsh
[hades1:06033] mca: base: components_register: component rsh register function
successful
[hades1:06033] mca: base: components_open: opening plm components
[hades1:06033] mca: base: components_open: found loaded component tm
[hades1:06033] mca: base: components_open: component tm open function successful
[hades1:06033] mca: base: components_open: found loaded component isolated
[hades1:06033] mca: base: components_open: component isolated open function
successful
[hades1:06033] mca: base: components_open: found loaded component rsh
[hades1:06033] mca: base: components_open: component rsh open function
successful
[hades1:06033] mca:base:select: Auto-selecting plm components
[hades1:06033] mca:base:select:( plm) Querying component [tm]
[hades1:06033] mca:base:select:( plm) Querying component [isolated]
[hades1:06033] mca:base:select:( plm) Query of component [isolated] set
priority to 0
[hades1:06033] mca:base:select:( plm) Querying component [rsh]
[hades1:06033] mca:base:select:( plm) Query of component [rsh] set priority to
10
[hades1:06033] mca:base:select:( plm) Selected component [rsh]
[hades1:06033] mca: base: close: component tm closed
[hades1:06033] mca: base: close: unloading component tm
[hades1:06033] mca: base: close: component isolated closed
[hades1:06033] mca: base: close: unloading component isolated
[hades1:06033] [[36677,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> OPAL_PREFIX=/opt/hpcx/ompi ; export
OPAL_PREFIX; PATH=/opt/hpcx/ompi/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/opt/hpcx/ompi/lib:${LD_LIBRARY_PATH:-} ; export
LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/hpcx/ompi/lib:${DYLD_LIBRARY_PATH:-} ;
export DYLD_LIBRARY_PATH ; /opt/hpcx/ompi/bin/orted -mca ess "env" -mca
ess_base_jobid "2403663872" -mca ess_base_vpid "<template>" -mca
ess_base_num_procs "2" -mca orte_node_regex "hades[1:1-2]@0(2)" -mca
orte_hnp_uri "2403663872.0;tcp://192.168.1.5:38793" --mca plm_base_verbose "10"
-mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri
"2403663872.0;tcp://192.168.1.5:38793" -mca pmix "^s1,s2,cray,isolated"
[hades2:524817] mca: base: components_register: registering framework plm
components
[hades2:524817] mca: base: components_register: found loaded component rsh
[hades2:524817] mca: base: components_register: component rsh register function
successful
[hades2:524817] mca: base: components_open: opening plm components
[hades2:524817] mca: base: components_open: found loaded component rsh
[hades2:524817] mca: base: components_open: component rsh open function
successful
[hades2:524817] mca:base:select: Auto-selecting plm components
[hades2:524817] mca:base:select:( plm) Querying component [rsh]
[hades2:524817] mca:base:select:( plm) Query of component [rsh] set priority
to 10
[hades2:524817] mca:base:select:( plm) Selected component [rsh]
[hades1:06033] [[36677,0],0] complete_setup on job [36677,1]
[hades1:06033] [[36677,0],0] plm:base:receive update proc state command from
[[36677,0],1]
[hades1:06033] [[36677,0],0] plm:base:receive got update_proc_state for job
[36677,1]
hades2
[hades1:06033] [[36677,0],0] plm:base:receive update proc state command from
[[36677,0],1]
[hades1:06033] [[36677,0],0] plm:base:receive got update_proc_state for job
[36677,1]
hades1
[hades2:524817] mca: base: close: component rsh closed
[hades2:524817] mca: base: close: unloading component rsh
[hades1:06033] mca: base: close: component rsh closed
[hades1:06033] mca: base: close: unloading component rshmpirun --mca plm_base_verbose 10 --host hades1,hades2
/mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c
[hades1:06043] mca: base: components_register: registering framework plm
components
[hades1:06043] mca: base: components_register: found loaded component tm
[hades1:06043] mca: base: components_register: component tm register function
successful
[hades1:06043] mca: base: components_register: found loaded component isolated
[hades1:06043] mca: base: components_register: component isolated has no
register or open function
[hades1:06043] mca: base: components_register: found loaded component rsh
[hades1:06043] mca: base: components_register: component rsh register function
successful
[hades1:06043] mca: base: components_open: opening plm components
[hades1:06043] mca: base: components_open: found loaded component tm
[hades1:06043] mca: base: components_open: component tm open function successful
[hades1:06043] mca: base: components_open: found loaded component isolated
[hades1:06043] mca: base: components_open: component isolated open function
successful
[hades1:06043] mca: base: components_open: found loaded component rsh
[hades1:06043] mca: base: components_open: component rsh open function
successful
[hades1:06043] mca:base:select: Auto-selecting plm components
[hades1:06043] mca:base:select:( plm) Querying component [tm]
[hades1:06043] mca:base:select:( plm) Querying component [isolated]
[hades1:06043] mca:base:select:( plm) Query of component [isolated] set
priority to 0
[hades1:06043] mca:base:select:( plm) Querying component [rsh]
[hades1:06043] mca:base:select:( plm) Query of component [rsh] set priority to
10
[hades1:06043] mca:base:select:( plm) Selected component [rsh]
[hades1:06043] mca: base: close: component tm closed
[hades1:06043] mca: base: close: unloading component tm
[hades1:06043] mca: base: close: component isolated closed
[hades1:06043] mca: base: close: unloading component isolated
[hades1:06043] [[36687,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> OPAL_PREFIX=/opt/hpcx/ompi ; export
OPAL_PREFIX; PATH=/opt/hpcx/ompi/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/opt/hpcx/ompi/lib:${LD_LIBRARY_PATH:-} ; export
LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/hpcx/ompi/lib:${DYLD_LIBRARY_PATH:-} ;
export DYLD_LIBRARY_PATH ; /opt/hpcx/ompi/bin/orted -mca ess "env" -mca
ess_base_jobid "2404319232" -mca ess_base_vpid "<template>" -mca
ess_base_num_procs "2" -mca orte_node_regex "hades[1:1-2]@0(2)" -mca
orte_hnp_uri "2404319232.0;tcp://192.168.1.5:45739" --mca plm_base_verbose "10"
-mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri
"2404319232.0;tcp://192.168.1.5:45739" -mca pmix "^s1,s2,cray,isolated"
[hades2:524878] mca: base: components_register: registering framework plm
components
[hades2:524878] mca: base: components_register: found loaded component rsh
[hades2:524878] mca: base: components_register: component rsh register function
successful
[hades2:524878] mca: base: components_open: opening plm components
[hades2:524878] mca: base: components_open: found loaded component rsh
[hades2:524878] mca: base: components_open: component rsh open function
successful
[hades2:524878] mca:base:select: Auto-selecting plm components
[hades2:524878] mca:base:select:( plm) Querying component [rsh]
[hades2:524878] mca:base:select:( plm) Query of component [rsh] set priority
to 10
[hades2:524878] mca:base:select:( plm) Selected component [rsh]
[hades1:06043] [[36687,0],0] complete_setup on job [36687,1]
[hades1:06043] [[36687,0],0] plm:base:receive update proc state command from
[[36687,0],1]
[hades1:06043] [[36687,0],0] plm:base:receive got update_proc_state for job
[36687,1]