The problem is with the tree-spawn nature of the rsh/ssh launcher. For scalability, mpirun only launches a first "layer" of daemons. Each of those daemons then launches another layer in a tree-like fanout. The default pattern is such that you first notice it when you have four nodes in your allocation.
You have two choices: * you can just add the MCA param plm_rsh_no_tree_spawn=1 to your environment/cmd line * you can resolve the tree spawn issue so that a daemon on one of your nodes is capable of ssh-ing a daemon on another node Either way will work. Ralph On Apr 11, 2014, at 11:17 AM, Allan Wu <al...@cs.ucla.edu> wrote: > Hello everyone, > > I am running a simple helloworld program on several nodes using OpenMPI 1.8. > Running commands on single node or small number of nodes are successful, but > when I tried to run the same binary on four different nodes, problems > occurred. > > I am using 'mpirun' command line like the following: > # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node -hostfile > hostfile ./helloworld > And my hostfile looks something like these: > 10.0.0.16 > 10.0.0.17 > 10.0.0.18 > 10.0.0.19 > > When executing this command, it will result in an error message "sh: syntax > error: unexpected word", and the program will deadlock. When I added > "--debug-devel" the output is in the attachment "err_msg_0.txt". In the log, > "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17" and so > on. > > However, the weird part is that after I remove one line in the hostfile, the > problem goes away. It does not matter which host I remove, as long as there > is less than four hosts, the program can execute without any problem. > > I also tried using hostname in the hostfile, as: > fpga0 > fpga1 > fpga2 > fpga3 > And the same problem occurs, and the error message becomes "Host key > verification failed.". I have setup public/private key pairs on all nodes, > and each node can ssh to any node without problems. I also attached the > message of --debug-devel as "err_msg_1.txt". > > I'm running MPI programs on embedded ARM processors. I have previously posted > questions on cross-compilation on the develop mailing list, which contains > the setup I used. If you need the information please refer to > http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the > output of 'ompi-info --all' is also attached with this email. > > Please let me know if I need to provide more information. Thanks in advance! > > Regards, > -- > Di Wu (Allan) > PhD student, VAST Laboratory, > Department of Computer Science, UC Los Angeles > Email: al...@cs.ucla.edu > <err_msg_0.txt><err_msg_1.txt><log.tar.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users