On Apr 13, 2014, at 11:42 AM, Allan Wu <al...@cs.ucla.edu> wrote: > Thanks, Ralph! > > Adding MAC parameter 'plm_rsh_no_tree_spawn' solves the problem. > > If I understand correctly, the first layer of daemons are three nodes, and > when there are more than three nodes the second layer of daemons are spawn. > So my problem is happened when MPI processes are launched by the second layer > of daemons, is that correct?
Yes, that is correct > I think that is very likely, the second layer of daemons may be missing some > environmental settings. > I would be really helpful if I can solve the problem though, is there any > documents I can find on the way the daemons work? Do you have any suggestions > on the way I can debug the issue? Easiest way to debug the issue is to add "-mca plm_base_verbose 5 --debug-daemons" to your command line. This will show the commands being used in the launch, and allow ssh errors to reach the screen. > > Thanks, > Allan > > On Sat, Apr 12, 2014 at 9:00 AM, <users-requ...@open-mpi.org> wrote: > > The problem is with the tree-spawn nature of the rsh/ssh launcher. For > scalability, mpirun only launches a first "layer" of daemons. Each of those > daemons then launches another layer in a tree-like fanout. The default > pattern is such that you first notice it when you have four nodes in your > allocation. > > You have two choices: > > * you can just add the MCA param plm_rsh_no_tree_spawn=1 to your > environment/cmd line > > * you can resolve the tree spawn issue so that a daemon on one of your nodes > is capable of ssh-ing a daemon on another node > > Either way will work. > Ralph > > > On Apr 11, 2014, at 11:17 AM, Allan Wu <al...@cs.ucla.edu> wrote: > > > Hello everyone, > > > > I am running a simple helloworld program on several nodes using OpenMPI > > 1.8. Running commands on single node or small number of nodes are > > successful, but when I tried to run the same binary on four different > > nodes, problems occurred. > > > > I am using 'mpirun' command line like the following: > > # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node -hostfile > > hostfile ./helloworld > > And my hostfile looks something like these: > > 10.0.0.16 > > 10.0.0.17 > > 10.0.0.18 > > 10.0.0.19 > > > > When executing this command, it will result in an error message "sh: syntax > > error: unexpected word", and the program will deadlock. When I added > > "--debug-devel" the output is in the attachment "err_msg_0.txt". In the > > log, "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17" > > and so on. > > > > However, the weird part is that after I remove one line in the hostfile, > > the problem goes away. It does not matter which host I remove, as long as > > there is less than four hosts, the program can execute without any problem. > > > > I also tried using hostname in the hostfile, as: > > fpga0 > > fpga1 > > fpga2 > > fpga3 > > And the same problem occurs, and the error message becomes "Host key > > verification failed.". I have setup public/private key pairs on all nodes, > > and each node can ssh to any node without problems. I also attached the > > message of --debug-devel as "err_msg_1.txt". > > > > I'm running MPI programs on embedded ARM processors. I have previously > > posted questions on cross-compilation on the develop mailing list, which > > contains the setup I used. If you need the information please refer to > > http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the > > output of 'ompi-info --all' is also attached with this email. > > > > Please let me know if I need to provide more information. Thanks in advance! > > > > Regards, > > -- > > Di Wu (Allan) > > PhD student, VAST Laboratory, > > Department of Computer Science, UC Los Angeles > > Email: al...@cs.ucla.edu > > <err_msg_0.txt><err_msg_1.txt><log.tar.gz>_______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users