I'm trying to execute this command:

*mpirun -np 8 --host openmpi@10.10.1.1
<openmpi@10.10.1.1>,openmpi@10.10.1.2 <openmpi@10.10.1.2>,openmpi@10.10.1.3
<openmpi@10.10.1.3>,openmpi@10.10.1.4 <openmpi@10.10.1.4> --mca
oob_tcp_if_exclude lo,wlp2s0 ompi_info*

Everything goes find if I execute the same command with only 2 nodes
(independently of which nodes).

With 3 or more nodes I obtain:
*ssh: connect to host 10 port 22: Invalid argument*
followed by "ORTE was unable to reliably start one or more daemons." error.

Searching with plm_base_verbose, I found:

...
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],1]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],1] to node openmpi@10.10.1.1
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],2]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],2] to node openmpi@10.10.1.2
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],3]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],3] to node openmpi@10.10.1.3
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],4]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],4] to node openmpi@10.10.1.4
...
[Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 0 not a child of mine
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.1 to
launch list
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.2 to
launch list
[Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 3 not a child of mine
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.4 to
launch list
...
[roaster-vm1:00593] [[53718,0],1] plm:rsh: remote spawn called
[roaster-vm1:00593] [[53718,0],1] plm:rsh: local shell: 0 (bash)
[roaster-vm1:00593] [[53718,0],1] plm:rsh: assuming same remote shell as
local shell
[roaster-vm1:00593] [[53718,0],1] plm:rsh: remote shell: 0 (bash)
[roaster-vm1:00593] [[53718,0],1] plm:rsh: final template argv:
/usr/bin/ssh <template>  orted --hnp-topo-sig
0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid
"3520462848" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "5"
-mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489" -mca orte_hnp_uri
"3520462848.0;tcp://10.10.10.2:43771" --mca oob_tcp_if_exclude "lo,wlp2s0"
--mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn
[roaster-vm1:00593] [[53718,0],1] plm:rsh: activating launch event
[roaster-vm1:00593] [[53718,0],1] plm:rsh: recording launch of daemon
[[53718,0],3]
[roaster-vm1:00593] [[53718,0],1] plm:rsh: executing: (/usr/bin/ssh)
[*/usr/bin/ssh
openmpi@10  orted* --hnp-topo-sig 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess
"env" -mca orte_ess_jobid "3520462848" -mca orte_ess_vpid 3 -mca
orte_ess_num_procs "5" -mca orte_parent_uri "3520462848.1;tcp://
10.10.1.1:35489" -mca orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771"
--mca oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm
"rsh" --tree-spawn]
*ssh: connect to host 10 port 22: Invalid argument*

It seems it corrupts the ip address during remote spawn. Any idea?

(I'm using 1.10.0rc7 version)


Cheers,
Federico

__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering

Reply via email to