Hello everyone, I am running a simple helloworld program on several nodes using OpenMPI 1.8. Running commands on single node or small number of nodes are successful, but when I tried to run the same binary on four different nodes, problems occurred.
I am using 'mpirun' command line like the following: # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node -hostfile hostfile ./helloworld And my hostfile looks something like these: 10.0.0.16 10.0.0.17 10.0.0.18 10.0.0.19 When executing this command, it will result in an error message "sh: syntax error: unexpected word", and the program will deadlock. When I added "--debug-devel" the output is in the attachment "err_msg_0.txt". In the log, "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17" and so on. However, the weird part is that after I remove one line in the hostfile, the problem goes away. It does not matter which host I remove, as long as there is less than four hosts, the program can execute without any problem. I also tried using hostname in the hostfile, as: fpga0 fpga1 fpga2 fpga3 And the same problem occurs, and the error message becomes "Host key verification failed.". I have setup public/private key pairs on all nodes, and each node can ssh to any node without problems. I also attached the message of --debug-devel as "err_msg_1.txt". I'm running MPI programs on embedded ARM processors. I have previously posted questions on cross-compilation on the develop mailing list, which contains the setup I used. If you need the information please refer to http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the output of 'ompi-info --all' is also attached with this email. Please let me know if I need to provide more information. Thanks in advance! Regards, -- Di Wu (Allan) PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, Department of Computer Science, UC Los Angeles Email: al...@cs.ucla.edu
[fpga0:02057] sess_dir_finalize: job session dir not empty - leaving [fpga0:02057] procdir: /tmp/openmpi-sessions-root@fpga0_0/16688/0/0 [fpga0:02057] jobdir: /tmp/openmpi-sessions-root@fpga0_0/16688/0 [fpga0:02057] top: openmpi-sessions-root@fpga0_0 [fpga0:02057] tmp: /tmp [fpga0:02057] mpirun: reset PATH: /mnt/embedded_root/openmpi/bin:/mnt/embedded_root/openmpi/bin:/sbin:/usr/sbin:/bin:/usr/bin [fpga0:02057] mpirun: reset LD_LIBRARY_PATH: /mnt/embedded_root/openmpi/lib:/mnt/embedded_root/openmpi/lib:/mnt/embedded_root/lib [fpga2:01321] sess_dir_finalize:i job session dir not empty - leaving [fpga1:01882] sess_dir_finalize: job session dir not empty - leaving [fpga2:01321] procdir: /tmp/openmpi-sessions-root@fpga2_0/16688/0/2 [fpga2:01321] jobdir: /tmp/openmpi-sessions-root@fpga2_0/16688/0 [fpga2:01321] top: openmpi-sessions-root@fpga2_0 [fpga2:01321] tmp: /tmp [fpga1:01882] procdir: /tmp/openmpi-sessions-root@fpga1_0/16688/0/1 [fpga1:01882] jobdir: /tmp/openmpi-sessions-root@fpga1_0/16688/0 [fpga1:01882] top: openmpi-sessions-root@fpga1_0 [fpga1:01882] tmp: /tmp sh: syntax error: unexpected word [fpga2:01321] sess_dir_finalize: job session dir not empty - leaving exiting with status 0 [fpga1:01882] sess_dir_finalize: job session dir not empty - leaving exiting with status 1 ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate [fpga0:02057] sess_dir_finalize: job session dir not empty - leaving
[fpga0:02104] sess_dir_finalize: job session dir not empty - leaving [fpga0:02104] procdir: /tmp/openmpi-sessions-root@fpga0_0/16641/0/0 [fpga0:02104] jobdir: /tmp/openmpi-sessions-root@fpga0_0/16641/0 [fpga0:02104] top: openmpi-sessions-root@fpga0_0 [fpga0:02104] tmp: /tmp [fpga0:02104] mpirun: reset PATH: /mnt/embedded_root/openmpi/bin:/mnt/embedded_root/openmpi/bin:/sbin:/usr/sbin:/bin:/usr/bin [fpga0:02104] mpirun: reset LD_LIBRARY_PATH: /mnt/embedded_root/openmpi/lib:/mnt/embedded_root/openmpi/lib:/mnt/embedded_root/lib [fpga2:01361] sess_dir_finalize: job session dir not empty - leaving [fpga1:01926] sess_dir_finalize: job session dir not empty - leaving [fpga1:01926] procdir: /tmp/openmpi-sessions-root@fpga1_0/16641/0/1 [fpga1:01926] jobdir: /tmp/openmpi-sessions-root@fpga1_0/16641/0 [fpga1:01926] top: openmpi-sessions-root@fpga1_0 [fpga1:01926] tmp: /tmp [fpga2:01361] procdir: /tmp/openmpi-sessions-root@fpga2_0/16641/0/2 [fpga2:01361] jobdir: /tmp/openmpi-sessions-root@fpga2_0/16641/0 [fpga2:01361] top: openmpi-sessions-root@fpga2_0 [fpga2:01361] tmp: /tmp Host key verification failed. [fpga2:01361] sess_dir_finalize: job session dir not empty - leaving [fpga1:01926] sess_dir_finalize: job session dir not empty - leaving exiting with status 0 exiting with status 1 ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate [fpga0:02104] sess_dir_finalize: job session dir not empty - leaving
log.tar.gz
Description: GNU Zip compressed data