I'm trying to execute this command:
/mpirun -np 8 --host openmpi@10.10.1.1
<mailto:openmpi@10.10.1.1>,openmpi@10.10.1.2,openmpi@10.10.1.3,openmpi@10.10.1.4
--mca oob_tcp_if_exclude lo,wlp2s0 ompi_info
/
Everything goes find if I execute the same command with only 2 nodes
(independently of which nodes).
With 3 or more nodes I obtain:
*ssh: connect to host 10 port 22: Invalid argument*
followed by "ORTE was unable to reliably start one or more daemons."
error.
Searching with plm_base_verbose, I found:
...
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
[[53718,0],1]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],1] to node openmpi@10.10.1.1 <mailto:openmpi@10.10.1.1>
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
[[53718,0],2]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],2] to node openmpi@10.10.1.2 <mailto:openmpi@10.10.1.2>
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
[[53718,0],3]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],3] to node openmpi@10.10.1.3 <mailto:openmpi@10.10.1.3>
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
[[53718,0],4]
[Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
[[53718,0],4] to node openmpi@10.10.1.4 <mailto:openmpi@10.10.1.4>
...
[Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 0 not a child of mine
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.1
to launch list
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.2
to launch list
[Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 3 not a child of mine
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.4
to launch list
...
[roaster-vm1:00593] [[53718,0],1] plm:rsh: remote spawn called
[roaster-vm1:00593] [[53718,0],1] plm:rsh: local shell: 0 (bash)
[roaster-vm1:00593] [[53718,0],1] plm:rsh: assuming same remote shell
as local shell
[roaster-vm1:00593] [[53718,0],1] plm:rsh: remote shell: 0 (bash)
[roaster-vm1:00593] [[53718,0],1] plm:rsh: final template argv:
/usr/bin/ssh <template> orted --hnp-topo-sig
0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid
"3520462848" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs
"5" -mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489
<http://10.10.1.1:35489>" -mca orte_hnp_uri
"3520462848.0;tcp://10.10.10.2:43771 <http://10.10.10.2:43771>" --mca
oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm
"rsh" --tree-spawn
[roaster-vm1:00593] [[53718,0],1] plm:rsh: activating launch event
[roaster-vm1:00593] [[53718,0],1] plm:rsh: recording launch of daemon
[[53718,0],3]
[roaster-vm1:00593] [[53718,0],1] plm:rsh: executing: (/usr/bin/ssh)
[*/usr/bin/ssh openmpi@10 orted* --hnp-topo-sig
0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid
"3520462848" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca
orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489
<http://10.10.1.1:35489>" -mca orte_hnp_uri
"3520462848.0;tcp://10.10.10.2:43771 <http://10.10.10.2:43771>" --mca
oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm
"rsh" --tree-spawn]
*ssh: connect to host 10 port 22: Invalid argument*
It seems it corrupts the ip address during remote spawn. Any idea?
(I'm using 1.10.0rc7 version)
Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/11/28042.php