Thank you for your help! The issue is definitely the firewall. I guess, since I don't plan on having any communication between "slave" nodes of my cluster (SPMD with no cross-talk), and it is fairly small, I'll stick with option 2 for now.
On Mon, Mar 28, 2011 at 3:43 PM, Ralph Castain <r...@open-mpi.org> wrote: > It is hanging because your last nodes are not receiving the launch command. > > The daemons receive a message from mpirun telling them what to launch. That > message is sent via a tree-like routing algorithm. So mpirun sends to the > first two daemons, each of which relays it on to some number of daemons, each > of which relays it to another number, etc. > > What is happening here is that the first pair of daemons are not relaying the > message on to the next layer. You can try a couple of things: > > 1. ensure that it is possible for a daemon on one node to open a TCP socket > to any other node - i.e., that a daemon on cluster1 (for example) can open a > socket to cluster5 and send a message across. You might have a firewall in > the way, or some other prohibition blocking this connection. > > 2. given the small size of the cluster, add "-mca routed direct" to your > command line. This will tell mpirun to talk directly to each daemon. However, > note that your job may still fail as the procs won't be able to open sockets > to their peers to send MPI messages, if you use TCP for the MPI transport. > > Ralph > > On Mar 28, 2011, at 1:24 PM, Igor wrote: > >> Hello, >> >> First off, complete MPI newbie here. I have installed >> openmpi-1.4.3-1.fc13.i686 on an IBM blade cluster running Fedora. I >> can open as many slots as I want on remote machines, as long as I only >> connect to two machines (doesn't matter which two). >> >> For example, I run my mpi task from "cluster" and if my hostfile is: >> >> cluster slots=1 max-slots=1 >> cluster3 slots=1 >> cluster5 slots=1 >> cluster1 slots=1 >> >> If I now run: >> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 3 --hostfile >> /home/username/.mpi_hostfile hostname >> >> The output is >> cluster.mydomain.ca >> cluster3.mydomain.ca >> cluster5.mydomain.ca >> >> If I run: >> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 4 --hostfile >> /home/username/.mpi_hostfile hostname >> I expect to see: >> cluster.mydomain.ca >> cluster3.mydomain.ca >> cluster5.mydomain.ca >> cluster1.mydomain.ca >> >> Instead, I see the same output as when running 3 processes (-np 3), >> and the task hangs. >> >> Below is the output when I run mpirun with --debug-daemons tag. The >> same behaviour is seen, the process hangs when "-np 4" is requested: >> >> ################################ >> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np >> 3 --hostfile /home/username/.mpi_hostfile hostname >> Daemon was launched on cluster3.mydomain.ca - beginning to initialize >> Daemon was launched on cluster5.mydomain.ca - beginning to initialize >> Daemon [[12927,0],1] checking in as pid 3096 on host cluster3.mydomain.ca >> Daemon [[12927,0],1] not using static ports >> [cluster3.mydomain.ca:03096] [[12927,0],1] orted: up and running - >> waiting for commands! >> Daemon [[12927,0],2] checking in as pid 11301 on host cluster5.mydomain.ca >> Daemon [[12927,0],2] not using static ports >> [cluster.mydomain.ca:12279] [[12927,0],0] node[0].name cluster daemon >> 0 arch ffca0200 >> [cluster5.mydomain.ca:11301] [[12927,0],2] orted: up and running - >> waiting for commands! >> [cluster.mydomain.ca:12279] [[12927,0],0] node[1].name cluster3 daemon >> 1 arch ffca0200 >> [cluster.mydomain.ca:12279] [[12927,0],0] node[2].name cluster5 daemon >> 2 arch ffca0200 >> [cluster.mydomain.ca:12279] [[12927,0],0] node[3].name cluster1 daemon >> INVALID arch ffca0200 >> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received add_local_procs >> [cluster3.mydomain.ca:03096] [[12927,0],1] node[0].name cluster daemon >> 0 arch ffca0200 >> [cluster3.mydomain.ca:03096] [[12927,0],1] node[1].name cluster3 >> daemon 1 arch ffca0200 >> [cluster3.mydomain.ca:03096] [[12927,0],1] node[2].name cluster5 >> daemon 2 arch ffca0200 >> [cluster3.mydomain.ca:03096] [[12927,0],1] node[3].name cluster1 >> daemon INVALID arch ffca0200 >> [cluster5.mydomain.ca:11301] [[12927,0],2] node[0].name cluster daemon >> 0 arch ffca0200 >> [cluster5.mydomain.ca:11301] [[12927,0],2] node[1].name cluster3 >> daemon 1 arch ffca0200 >> [cluster5.mydomain.ca:11301] [[12927,0],2] node[2].name cluster5 >> daemon 2 arch ffca0200 >> [cluster5.mydomain.ca:11301] [[12927,0],2] node[3].name cluster1 >> daemon INVALID arch ffca0200 >> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received >> add_local_procs >> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received >> add_local_procs >> cluster.mydomain.ca >> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received waitpid_fired >> cmd >> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received iof_complete >> cmd >> cluster3.mydomain.ca >> cluster5.mydomain.ca >> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received waitpid_fired >> cmd >> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received waitpid_fired >> cmd >> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received iof_complete >> cmd >> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received iof_complete >> cmd >> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received exit >> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received exit >> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received exit >> [cluster3.mydomain.ca:03096] [[12927,0],1] orted: finalizing >> [cluster5.mydomain.ca:11301] [[12927,0],2] orted: finalizing >> >> ################################ >> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np >> 4 --hostfile /home/username/.mpi_hostfile hostname >> Daemon was launched on cluster5.mydomain.ca - beginning to initialize >> Daemon was launched on cluster3.mydomain.ca - beginning to initialize >> Daemon [[12919,0],2] checking in as pid 11325 on host cluster5.mydomain.ca >> Daemon [[12919,0],2] not using static ports >> [cluster5.mydomain.ca:11325] [[12919,0],2] orted: up and running - >> waiting for commands! >> Daemon was launched on cluster1.mydomain.ca - beginning to initialize >> Daemon [[12919,0],1] checking in as pid 3120 on host cluster3.mydomain.ca >> Daemon [[12919,0],1] not using static ports >> [cluster3.mydomain.ca:03120] [[12919,0],1] orted: up and running - >> waiting for commands! >> Daemon [[12919,0],3] checking in as pid 5623 on host cluster1.mydomain.ca >> Daemon [[12919,0],3] not using static ports >> [cluster1.mydomain.ca:05623] [[12919,0],3] orted: up and running - >> waiting for commands! >> [cluster.mydomain.ca:12287] [[12919,0],0] node[0].name cluster daemon >> 0 arch ffca0200 >> [cluster.mydomain.ca:12287] [[12919,0],0] node[1].name cluster3 daemon >> 1 arch ffca0200 >> [cluster.mydomain.ca:12287] [[12919,0],0] node[2].name cluster5 daemon >> 2 arch ffca0200 >> [cluster.mydomain.ca:12287] [[12919,0],0] node[3].name cluster1 daemon >> 3 arch ffca0200 >> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received add_local_procs >> [cluster5.mydomain.ca:11325] [[12919,0],2] node[0].name cluster daemon >> 0 arch ffca0200 >> [cluster5.mydomain.ca:11325] [[12919,0],2] node[1].name cluster3 >> daemon 1 arch ffca0200 >> [cluster5.mydomain.ca:11325] [[12919,0],2] node[2].name cluster5 >> daemon 2 arch ffca0200 >> [cluster5.mydomain.ca:11325] [[12919,0],2] node[3].name cluster1 >> daemon 3 arch ffca0200 >> [cluster3.mydomain.ca:03120] [[12919,0],1] node[0].name cluster daemon >> 0 arch ffca0200 >> [cluster3.mydomain.ca:03120] [[12919,0],1] node[1].name cluster3 >> daemon 1 arch ffca0200 >> [cluster3.mydomain.ca:03120] [[12919,0],1] node[2].name cluster5 >> daemon 2 arch ffca0200 >> [cluster3.mydomain.ca:03120] [[12919,0],1] node[3].name cluster1 >> daemon 3 arch ffca0200 >> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received >> add_local_procs >> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received >> add_local_procs >> cluster.mydomain.ca >> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received waitpid_fired >> cmd >> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received iof_complete >> cmd >> cluster3.mydomain.ca >> cluster5.mydomain.ca >> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received waitpid_fired >> cmd >> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received iof_complete >> cmd >> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received waitpid_fired >> cmd >> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received iof_complete >> cmd >> <<<<<<<<<<<<THE PROCESS HANGS HERE>>>>>>>>>>>> >> ^CKilled by signal 2. >> Killed by signal 2. >> Killed by signal 2. >> -------------------------------------------------------------------------- >> A daemon (pid 12288) died unexpectedly with status 255 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate >> >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received exit >> mpirun: clean termination accomplished >> >> ################################ >> >> Notes: >> 1. Passwordless ssh login between all cluster# machines works fine. >> 2. It doesn't matter which two machines I specify in .mpi_hostfile. I >> can always connect to 1 or 2 of them, and get the freeze when I try 3 >> or more. >> 3. I installed Open MPI using the yum installer of Fedora. By default, >> it chose /usr/lib/openmpi/ as the install directory, instead of the >> /opt/openmpi-... that is mentioned throughout the Open MPI FAQ. I >> can't imagine that to be a problem... >> 4. Supplying PATH and LD_LIBRARY_PATH: The Open MPI FAQ says >> "specifying the absolute pathname to mpirun is equivalent to using the >> --prefix argument", so that's what I chose, after reading all the >> scaremongering about modifying LD_LIBRARY_PATH :) Adding >> "/usr/lib/openmpi/lib" to the otherwise empty LD_LIBRARY_PATH produces >> same results. >> >> Can someone suggest a possible solution or at least a direction in >> which I should continue my troubleshooting? >> >> -- >> >> Thank you all for your time, >> Igor >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Regards, Igor