I can't check it this week due to the Supercomputing project. It looks like
you are feeding us a hostfile that contains userid and a hostname expressed
as an IP address. Can you convert the IP address to a name? I think that
might be a workaround until I can address it.


On Tue, Nov 17, 2015 at 4:19 AM, Federico Reghenzani <
federico1.reghenz...@mail.polimi.it> wrote:

> I'm trying to execute this command:
>
>
> *mpirun -np 8 --host openmpi@10.10.1.1
> <openmpi@10.10.1.1>,openmpi@10.10.1.2 <openmpi@10.10.1.2>,openmpi@10.10.1.3
> <openmpi@10.10.1.3>,openmpi@10.10.1.4 <openmpi@10.10.1.4> --mca
> oob_tcp_if_exclude lo,wlp2s0 ompi_info*
>
> Everything goes find if I execute the same command with only 2 nodes
> (independently of which nodes).
>
> With 3 or more nodes I obtain:
> *ssh: connect to host 10 port 22: Invalid argument*
> followed by "ORTE was unable to reliably start one or more daemons." error.
>
> Searching with plm_base_verbose, I found:
>
> ...
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],1]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],1] to node openmpi@10.10.1.1
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],2]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],2] to node openmpi@10.10.1.2
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],3]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],3] to node openmpi@10.10.1.3
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],4]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],4] to node openmpi@10.10.1.4
> ...
> [Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 0 not a child of mine
> [Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.1 to
> launch list
> [Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.2 to
> launch list
> [Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 3 not a child of mine
> [Neptune:22627] [[53718,0],0] plm:rsh: adding node openmpi@10.10.1.4 to
> launch list
> ...
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: remote spawn called
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: local shell: 0 (bash)
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: assuming same remote shell as
> local shell
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: remote shell: 0 (bash)
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: final template argv:
> /usr/bin/ssh <template>  orted --hnp-topo-sig
> 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid
> "3520462848" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "5"
> -mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489" -mca
> orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771" --mca
> oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm "rsh"
> --tree-spawn
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: activating launch event
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: recording launch of daemon
> [[53718,0],3]
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: executing: (/usr/bin/ssh) 
> [*/usr/bin/ssh
> openmpi@10  orted* --hnp-topo-sig 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess
> "env" -mca orte_ess_jobid "3520462848" -mca orte_ess_vpid 3 -mca
> orte_ess_num_procs "5" -mca orte_parent_uri "3520462848.1;tcp://
> 10.10.1.1:35489" -mca orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771"
> --mca oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm
> "rsh" --tree-spawn]
> *ssh: connect to host 10 port 22: Invalid argument*
>
> It seems it corrupts the ip address during remote spawn. Any idea?
>
> (I'm using 1.10.0rc7 version)
>
>
> Cheers,
> Federico
>
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28042.php
>

Reply via email to