Thank you for your help! The issue is definitely the firewall. I
guess, since I don't plan on having any communication between "slave"
nodes of my cluster (SPMD with no cross-talk), and it is fairly small,
I'll stick with option 2 for now.

On Mon, Mar 28, 2011 at 3:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
> It is hanging because your last nodes are not receiving the launch command.
>
> The daemons receive a message from mpirun telling them what to launch. That 
> message is sent via a tree-like routing algorithm. So mpirun sends to the 
> first two daemons, each of which relays it on to some number of daemons, each 
> of which relays it to another number, etc.
>
> What is happening here is that the first pair of daemons are not relaying the 
> message on to the next layer. You can try a couple of things:
>
> 1. ensure that it is possible for a daemon on one node to open a TCP socket 
> to any other node - i.e., that a daemon on cluster1 (for example) can open a 
> socket to cluster5 and send a message across. You might have a firewall in 
> the way, or some other prohibition blocking this connection.
>
> 2. given the small size of the cluster, add "-mca routed direct" to your 
> command line. This will tell mpirun to talk directly to each daemon. However, 
> note that your job may still fail as the procs won't be able to open sockets 
> to their peers to send MPI messages, if you use TCP for the MPI transport.
>
> Ralph
>
> On Mar 28, 2011, at 1:24 PM, Igor wrote:
>
>> Hello,
>>
>> First off, complete MPI newbie here. I have installed
>> openmpi-1.4.3-1.fc13.i686 on an IBM blade cluster running Fedora. I
>> can open as many slots as I want on remote machines, as long as I only
>> connect to two machines (doesn't matter which two).
>>
>> For example, I run my mpi task from "cluster" and if my hostfile is:
>>
>> cluster slots=1 max-slots=1
>> cluster3 slots=1
>> cluster5 slots=1
>> cluster1 slots=1
>>
>> If I now run:
>> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 3 --hostfile
>> /home/username/.mpi_hostfile hostname
>>
>> The output is
>> cluster.mydomain.ca
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>>
>> If I run:
>> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 4 --hostfile
>> /home/username/.mpi_hostfile hostname
>> I expect to see:
>> cluster.mydomain.ca
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>> cluster1.mydomain.ca
>>
>> Instead, I see the same output as when running 3 processes (-np 3),
>> and the task hangs.
>>
>> Below is the output when I run mpirun with --debug-daemons tag. The
>> same behaviour is seen, the process hangs when "-np 4" is requested:
>>
>> ################################
>> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np
>> 3 --hostfile /home/username/.mpi_hostfile hostname
>> Daemon was launched on cluster3.mydomain.ca - beginning to initialize
>> Daemon was launched on cluster5.mydomain.ca - beginning to initialize
>> Daemon [[12927,0],1] checking in as pid 3096 on host cluster3.mydomain.ca
>> Daemon [[12927,0],1] not using static ports
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted: up and running -
>> waiting for commands!
>> Daemon [[12927,0],2] checking in as pid 11301 on host cluster5.mydomain.ca
>> Daemon [[12927,0],2] not using static ports
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted: up and running -
>> waiting for commands!
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[1].name cluster3 daemon
>> 1 arch ffca0200
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[2].name cluster5 daemon
>> 2 arch ffca0200
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[3].name cluster1 daemon
>> INVALID arch ffca0200
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received add_local_procs
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[3].name cluster1
>> daemon INVALID arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[3].name cluster1
>> daemon INVALID arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received 
>> add_local_procs
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received 
>> add_local_procs
>> cluster.mydomain.ca
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received waitpid_fired 
>> cmd
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received iof_complete 
>> cmd
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received waitpid_fired 
>> cmd
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received waitpid_fired 
>> cmd
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received iof_complete 
>> cmd
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received iof_complete 
>> cmd
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received exit
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received exit
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received exit
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted: finalizing
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted: finalizing
>>
>> ################################
>> [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np
>> 4 --hostfile /home/username/.mpi_hostfile hostname
>> Daemon was launched on cluster5.mydomain.ca - beginning to initialize
>> Daemon was launched on cluster3.mydomain.ca - beginning to initialize
>> Daemon [[12919,0],2] checking in as pid 11325 on host cluster5.mydomain.ca
>> Daemon [[12919,0],2] not using static ports
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted: up and running -
>> waiting for commands!
>> Daemon was launched on cluster1.mydomain.ca - beginning to initialize
>> Daemon [[12919,0],1] checking in as pid 3120 on host cluster3.mydomain.ca
>> Daemon [[12919,0],1] not using static ports
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted: up and running -
>> waiting for commands!
>> Daemon [[12919,0],3] checking in as pid 5623 on host cluster1.mydomain.ca
>> Daemon [[12919,0],3] not using static ports
>> [cluster1.mydomain.ca:05623] [[12919,0],3] orted: up and running -
>> waiting for commands!
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[1].name cluster3 daemon
>> 1 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[2].name cluster5 daemon
>> 2 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[3].name cluster1 daemon
>> 3 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received add_local_procs
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[3].name cluster1
>> daemon 3 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[3].name cluster1
>> daemon 3 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received 
>> add_local_procs
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received 
>> add_local_procs
>> cluster.mydomain.ca
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received waitpid_fired 
>> cmd
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received iof_complete 
>> cmd
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received waitpid_fired 
>> cmd
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received iof_complete 
>> cmd
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received waitpid_fired 
>> cmd
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received iof_complete 
>> cmd
>> <<<<<<<<<<<<THE PROCESS HANGS HERE>>>>>>>>>>>>
>> ^CKilled by signal 2.
>> Killed by signal 2.
>> Killed by signal 2.
>> --------------------------------------------------------------------------
>> A daemon (pid 12288) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received exit
>> mpirun: clean termination accomplished
>>
>> ################################
>>
>> Notes:
>> 1. Passwordless ssh login between all cluster# machines works fine.
>> 2. It doesn't matter which two machines I specify in .mpi_hostfile. I
>> can always connect to 1 or 2 of them, and get the freeze when I try 3
>> or more.
>> 3. I installed Open MPI using the yum installer of Fedora. By default,
>> it chose /usr/lib/openmpi/ as the install directory, instead of the
>> /opt/openmpi-... that is mentioned throughout the Open MPI FAQ. I
>> can't imagine that to be a problem...
>> 4. Supplying PATH and LD_LIBRARY_PATH: The Open MPI FAQ says
>> "specifying the absolute pathname to mpirun is equivalent to using the
>> --prefix argument", so that's what I chose, after reading all the
>> scaremongering about modifying LD_LIBRARY_PATH :) Adding
>> "/usr/lib/openmpi/lib" to the otherwise empty LD_LIBRARY_PATH produces
>> same results.
>>
>> Can someone suggest a possible solution or at least a direction in
>> which I should continue my troubleshooting?
>>
>> --
>>
>> Thank you all for your time,
>> Igor
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 

Regards,
Igor

Reply via email to