Federico,

thanks for the report, i will push a fix shortly

meanwhile, and as a workaround, you can add the
--mca orte_keep_fqdn_hostnames true
to your mpirun command line when using --host user@ip

Cheers,

Gilles

On 11/17/2015 7:19 PM, Federico Reghenzani wrote:
I'm trying to execute this command:

/mpirun -np 8 --host openmpi@10.10.1.1 <mailto:openmpi@10.10.1.1>,openmpi@10.10.1.2 <mailto:openmpi@10.10.1.2>,openmpi@10.10.1.3 <mailto:openmpi@10.10.1.3>,openmpi@10.10.1.4 <mailto:openmpi@10.10.1.4> --mca oob_tcp_if_exclude lo,wlp2s0 ompi_info
/

Everything goes find if I execute the same command with only 2 nodes (independently of which nodes).

With 3 or more nodes I obtain:
*ssh: connect to host 10 port 22: Invalid argument*
followed by "ORTE was unable to reliably start one or more daemons." error.

Searching with plm_base_verbose, I found:

...
[Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],1] [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon [[53718,0],1] to node openmpi@10.10.1.1 <mailto:openmpi@10.10.1.1> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],2] [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon [[53718,0],2] to node openmpi@10.10.1.2 <mailto:openmpi@10.10.1.2> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],3] [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon [[53718,0],3] to node openmpi@10.10.1.3 <mailto:openmpi@10.10.1.3> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon [[53718,0],4] [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon [[53718,0],4] to node openmpi@10.10.1.4 <mailto:openmpi@10.10.1.4>
...
[Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 0 not a child of mine
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.1 <mailto:openmpi@10.10.1.1> to launch list [Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.2 <mailto:openmpi@10.10.1.2> to launch list
[Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 3 not a child of mine
[Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.4 <mailto:openmpi@10.10.1.4> to launch list
...
[roaster-vm1:00593] [[53718,0],1] plm:rsh: remote spawn called
[roaster-vm1:00593] [[53718,0],1] plm:rsh: local shell: 0 (bash)
[roaster-vm1:00593] [[53718,0],1] plm:rsh: assuming same remote shell as local shell
[roaster-vm1:00593] [[53718,0],1] plm:rsh: remote shell: 0 (bash)
[roaster-vm1:00593] [[53718,0],1] plm:rsh: final template argv:
/usr/bin/ssh <template> orted --hnp-topo-sig 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid "3520462848" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "5" -mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489 <http://10.10.1.1:35489>" -mca orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771 <http://10.10.10.2:43771>" --mca oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn
[roaster-vm1:00593] [[53718,0],1] plm:rsh: activating launch event
[roaster-vm1:00593] [[53718,0],1] plm:rsh: recording launch of daemon [[53718,0],3] [roaster-vm1:00593] [[53718,0],1] plm:rsh: executing: (/usr/bin/ssh) [*/usr/bin/ssh openmpi@10 orted* --hnp-topo-sig 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid "3520462848" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489 <http://10.10.1.1:35489>" -mca orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771 <http://10.10.10.2:43771>" --mca oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn]
*ssh: connect to host 10 port 22: Invalid argument*

It seems it corrupts the ip address during remote spawn. Any idea?

(I'm using 1.10.0rc7 version)


Cheers,
Federico

__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/11/28042.php

Reply via email to