This is my first attempt at configuring a Beowulf cluster running MPI. ALL of the nodes are PS3s running Yellow Dog Linux 6.2 and the host (server) is a Dell i686 Quad-core running Fedora Core 12. Thanks to a couple of members on this forum (in a previous question), I learned that I needed to download the openmpi code, configure, compile and install it on each of my machines. I downloaded v1.4.1. I configured openmpi for non-heterogeneous and compiled and installed individually on each node and the server. I have an NSF shared directory on the host where the application resides after building. All nodes have access to the shared volume and they can see any files in the shared volume. SSH is configured and the server can remote into each node without using a password and vice versa. The built-in firewalls (iptables and ip6tables) are disabled.
I downloaded and modified a very simple master/slave framework application where the slave does a simple computation and gets the processor name. The slave returns both pieces of information to the master who then simply displays it in the terminal window. The master farms out 1024 such tasks to the slaves and after finalizing the master exists. I run the application in one of three ways: 1. “mpirun –np 2 host_application” - launched and run locally on the server and uses one of it remaining 3 cores as a slave 2. “mpirun –np 1 node_application” - launched and run locally on the node and uses the second slot as a slave 3. “mpirun –np 1 --host host_name host_application ; -np 1 --host hostfile node_application” - runs host_application as master on the Dell server and runs node_application as a slave (rank=1) on the first PS3. host_application and node_application are identical but compiled on their respective machines to create loadable executables for that machine. OK, so methods 1 and 2 run fine and the master farms out 1024 tasks to the slave. The return values look like I expect. However, when I run method 3, the application hangs - no error messages, nothing. What I have discovered through rudimentary debugging (using files) is that the master (Dell) initiates the MPI_Init call and node_application is launched on the slave (PS3). The slave recognizes itself as rank 1 and enters the slave code, which is to wait for the first message from the master. However, the message from the master, an MPI_Send, is never received by the slave. Since MPI_Send on the master is blocking and the MPI_Recv on the slave is also blocking, the processing simply stalls. This appears to be some kind of configuration issue between Fedora and YDL. Or, I have not set something up properly. Please keep in mind that when the applications are running locally, they are performing the same Init, Send and Recv calls as when farming out to the cluster, but just no going off board, so to speak. Compiling and running the application on the native hardware works perfectly (ie: compiled and run on the PS3 or compiled and run on the Dell). So, I know that the code was written properly and executing properly locally. Has anyone else experienced this kind of behavior? Were you able to solve it? Anyone have any suggestions as to where I might look to resolve this issue? Thanks. Lee Manko