On Apr 13, 2014, at 11:42 AM, Allan Wu <al...@cs.ucla.edu> wrote:

> Thanks, Ralph!
> 
> Adding MAC parameter 'plm_rsh_no_tree_spawn' solves the problem. 
> 
> If I understand correctly, the first layer of daemons are three nodes, and 
> when there are more than three nodes the second layer of daemons are spawn. 
> So my problem is happened when MPI processes are launched by the second layer 
> of daemons, is that correct?

Yes, that is correct

> I think that is very likely, the second layer of daemons may be missing some 
> environmental settings. 
> I would be really helpful if I can solve the problem though, is there any 
> documents I can find on the way the daemons work? Do you have any suggestions 
> on the way I can debug the issue?

Easiest way to debug the issue is to add "-mca plm_base_verbose 5 
--debug-daemons" to your command line. This will show the commands being used 
in the launch, and allow ssh errors to reach the screen.


> 
> Thanks,
> Allan 
> 
> On Sat, Apr 12, 2014 at 9:00 AM, <users-requ...@open-mpi.org> wrote:
> 
> The problem is with the tree-spawn nature of the rsh/ssh launcher. For 
> scalability, mpirun only launches a first "layer" of daemons. Each of those 
> daemons then launches another layer in a tree-like fanout. The default 
> pattern is such that you first notice it when you have four nodes in your 
> allocation.
> 
> You have two choices:
> 
> * you can just add the MCA param plm_rsh_no_tree_spawn=1 to your 
> environment/cmd line
> 
> * you can resolve the tree spawn issue so that a daemon on one of your nodes 
> is capable of ssh-ing a daemon on another node
> 
> Either way will work.
> Ralph
> 
> 
> On Apr 11, 2014, at 11:17 AM, Allan Wu <al...@cs.ucla.edu> wrote:
> 
> > Hello everyone,
> >
> > I am running a simple helloworld program on several nodes using OpenMPI 
> > 1.8. Running commands on single node or small number of nodes are 
> > successful, but when I tried to run the same binary on four different 
> > nodes, problems occurred.
> >
> > I am using 'mpirun' command line like the following:
> > # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node -hostfile 
> > hostfile ./helloworld
> > And my hostfile looks something like these:
> > 10.0.0.16
> > 10.0.0.17
> > 10.0.0.18
> > 10.0.0.19
> >
> > When executing this command, it will result in an error message "sh: syntax 
> > error: unexpected word", and the program will deadlock. When I added 
> > "--debug-devel" the output is in the attachment "err_msg_0.txt". In the 
> > log, "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17" 
> > and so on.
> >
> > However, the weird part is that after I remove one line in the hostfile, 
> > the problem goes away. It does not matter which host I remove, as long as 
> > there is less than four hosts, the program can execute without any problem.
> >
> > I also tried using hostname in the hostfile, as:
> > fpga0
> > fpga1
> > fpga2
> > fpga3
> > And the same problem occurs, and the error message becomes "Host key 
> > verification failed.". I have setup public/private key pairs on all nodes, 
> > and each node can ssh to any node without problems. I also attached the 
> > message of --debug-devel as "err_msg_1.txt".
> >
> > I'm running MPI programs on embedded ARM processors. I have previously 
> > posted questions on cross-compilation on the develop mailing list, which 
> > contains the setup I used. If you need the information please refer to 
> > http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the 
> > output of 'ompi-info --all' is also attached with this email.
> >
> > Please let me know if I need to provide more information. Thanks in advance!
> >
> > Regards,
> > --
> > Di Wu (Allan)
> > PhD student, VAST Laboratory,
> > Department of Computer Science, UC Los Angeles
> > Email: al...@cs.ucla.edu
> > <err_msg_0.txt><err_msg_1.txt><log.tar.gz>_______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to