Hello, First off, complete MPI newbie here. I have installed openmpi-1.4.3-1.fc13.i686 on an IBM blade cluster running Fedora. I can open as many slots as I want on remote machines, as long as I only connect to two machines (doesn't matter which two).
For example, I run my mpi task from "cluster" and if my hostfile is: cluster slots=1 max-slots=1 cluster3 slots=1 cluster5 slots=1 cluster1 slots=1 If I now run: [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 3 --hostfile /home/username/.mpi_hostfile hostname The output is cluster.mydomain.ca cluster3.mydomain.ca cluster5.mydomain.ca If I run: [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 4 --hostfile /home/username/.mpi_hostfile hostname I expect to see: cluster.mydomain.ca cluster3.mydomain.ca cluster5.mydomain.ca cluster1.mydomain.ca Instead, I see the same output as when running 3 processes (-np 3), and the task hangs. Below is the output when I run mpirun with --debug-daemons tag. The same behaviour is seen, the process hangs when "-np 4" is requested: ################################ [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np 3 --hostfile /home/username/.mpi_hostfile hostname Daemon was launched on cluster3.mydomain.ca - beginning to initialize Daemon was launched on cluster5.mydomain.ca - beginning to initialize Daemon [[12927,0],1] checking in as pid 3096 on host cluster3.mydomain.ca Daemon [[12927,0],1] not using static ports [cluster3.mydomain.ca:03096] [[12927,0],1] orted: up and running - waiting for commands! Daemon [[12927,0],2] checking in as pid 11301 on host cluster5.mydomain.ca Daemon [[12927,0],2] not using static ports [cluster.mydomain.ca:12279] [[12927,0],0] node[0].name cluster daemon 0 arch ffca0200 [cluster5.mydomain.ca:11301] [[12927,0],2] orted: up and running - waiting for commands! [cluster.mydomain.ca:12279] [[12927,0],0] node[1].name cluster3 daemon 1 arch ffca0200 [cluster.mydomain.ca:12279] [[12927,0],0] node[2].name cluster5 daemon 2 arch ffca0200 [cluster.mydomain.ca:12279] [[12927,0],0] node[3].name cluster1 daemon INVALID arch ffca0200 [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received add_local_procs [cluster3.mydomain.ca:03096] [[12927,0],1] node[0].name cluster daemon 0 arch ffca0200 [cluster3.mydomain.ca:03096] [[12927,0],1] node[1].name cluster3 daemon 1 arch ffca0200 [cluster3.mydomain.ca:03096] [[12927,0],1] node[2].name cluster5 daemon 2 arch ffca0200 [cluster3.mydomain.ca:03096] [[12927,0],1] node[3].name cluster1 daemon INVALID arch ffca0200 [cluster5.mydomain.ca:11301] [[12927,0],2] node[0].name cluster daemon 0 arch ffca0200 [cluster5.mydomain.ca:11301] [[12927,0],2] node[1].name cluster3 daemon 1 arch ffca0200 [cluster5.mydomain.ca:11301] [[12927,0],2] node[2].name cluster5 daemon 2 arch ffca0200 [cluster5.mydomain.ca:11301] [[12927,0],2] node[3].name cluster1 daemon INVALID arch ffca0200 [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received add_local_procs [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received add_local_procs cluster.mydomain.ca [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received waitpid_fired cmd [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received iof_complete cmd cluster3.mydomain.ca cluster5.mydomain.ca [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received waitpid_fired cmd [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received waitpid_fired cmd [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received iof_complete cmd [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received iof_complete cmd [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received exit [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received exit [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received exit [cluster3.mydomain.ca:03096] [[12927,0],1] orted: finalizing [cluster5.mydomain.ca:11301] [[12927,0],2] orted: finalizing ################################ [username@cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np 4 --hostfile /home/username/.mpi_hostfile hostname Daemon was launched on cluster5.mydomain.ca - beginning to initialize Daemon was launched on cluster3.mydomain.ca - beginning to initialize Daemon [[12919,0],2] checking in as pid 11325 on host cluster5.mydomain.ca Daemon [[12919,0],2] not using static ports [cluster5.mydomain.ca:11325] [[12919,0],2] orted: up and running - waiting for commands! Daemon was launched on cluster1.mydomain.ca - beginning to initialize Daemon [[12919,0],1] checking in as pid 3120 on host cluster3.mydomain.ca Daemon [[12919,0],1] not using static ports [cluster3.mydomain.ca:03120] [[12919,0],1] orted: up and running - waiting for commands! Daemon [[12919,0],3] checking in as pid 5623 on host cluster1.mydomain.ca Daemon [[12919,0],3] not using static ports [cluster1.mydomain.ca:05623] [[12919,0],3] orted: up and running - waiting for commands! [cluster.mydomain.ca:12287] [[12919,0],0] node[0].name cluster daemon 0 arch ffca0200 [cluster.mydomain.ca:12287] [[12919,0],0] node[1].name cluster3 daemon 1 arch ffca0200 [cluster.mydomain.ca:12287] [[12919,0],0] node[2].name cluster5 daemon 2 arch ffca0200 [cluster.mydomain.ca:12287] [[12919,0],0] node[3].name cluster1 daemon 3 arch ffca0200 [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received add_local_procs [cluster5.mydomain.ca:11325] [[12919,0],2] node[0].name cluster daemon 0 arch ffca0200 [cluster5.mydomain.ca:11325] [[12919,0],2] node[1].name cluster3 daemon 1 arch ffca0200 [cluster5.mydomain.ca:11325] [[12919,0],2] node[2].name cluster5 daemon 2 arch ffca0200 [cluster5.mydomain.ca:11325] [[12919,0],2] node[3].name cluster1 daemon 3 arch ffca0200 [cluster3.mydomain.ca:03120] [[12919,0],1] node[0].name cluster daemon 0 arch ffca0200 [cluster3.mydomain.ca:03120] [[12919,0],1] node[1].name cluster3 daemon 1 arch ffca0200 [cluster3.mydomain.ca:03120] [[12919,0],1] node[2].name cluster5 daemon 2 arch ffca0200 [cluster3.mydomain.ca:03120] [[12919,0],1] node[3].name cluster1 daemon 3 arch ffca0200 [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received add_local_procs [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received add_local_procs cluster.mydomain.ca [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received waitpid_fired cmd [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received iof_complete cmd cluster3.mydomain.ca cluster5.mydomain.ca [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received waitpid_fired cmd [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received iof_complete cmd [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received waitpid_fired cmd [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received iof_complete cmd <<<<<<<<<<<<THE PROCESS HANGS HERE>>>>>>>>>>>> ^CKilled by signal 2. Killed by signal 2. Killed by signal 2. -------------------------------------------------------------------------- A daemon (pid 12288) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received exit mpirun: clean termination accomplished ################################ Notes: 1. Passwordless ssh login between all cluster# machines works fine. 2. It doesn't matter which two machines I specify in .mpi_hostfile. I can always connect to 1 or 2 of them, and get the freeze when I try 3 or more. 3. I installed Open MPI using the yum installer of Fedora. By default, it chose /usr/lib/openmpi/ as the install directory, instead of the /opt/openmpi-... that is mentioned throughout the Open MPI FAQ. I can't imagine that to be a problem... 4. Supplying PATH and LD_LIBRARY_PATH: The Open MPI FAQ says "specifying the absolute pathname to mpirun is equivalent to using the --prefix argument", so that's what I chose, after reading all the scaremongering about modifying LD_LIBRARY_PATH :) Adding "/usr/lib/openmpi/lib" to the otherwise empty LD_LIBRARY_PATH produces same results. Can someone suggest a possible solution or at least a direction in which I should continue my troubleshooting? -- Thank you all for your time, Igor