This is my first attempt at configuring a Beowulf cluster running MPI. ALL of the nodes in the cluster are PS3s running Yellow Dog Linux 6.2 and the host (server) is a Dell i686 Quad-core running Fedora Core 12. The cluster is running openMPI v1.4.1 configured (non-homogeneous), compiled and installed individually on each node and the server. I have an NSF shared directory on the host where the application resides after building. All nodes have access to the shared volume and can see all files in the shared volume. SSH is configured and the server can remote into each node without using a password and vice versa. The built-in firewalls (iptables and ip6tables) are disabled. The server has a dual Ethernet card. The first eth1, is used for cluster communications and has a static IP address of 192.168.0.1. The second, eth2 is used to communicate with the outside world and is connected to a corporate network getting a DHCP assigned IP address..
I have a very simple master/slave framework application where the slave does a simple computation and return the result and the processor name. The master farms out 1024 such tasks to the slaves and after finalizing the master exists. When I run the code locally on the multiple cores on either the server or the PS3, the code executes and completes as expected. However, when I have mpirun spread the work across the nodes, the process hangs waiting for messages to be passed between the server and the nodes. What I have discovered is that if I unplug the second NIC running DHCP the process executes fine. I have requested a static IP address from the network admin, but was curious as to whether anyone has run into this when running DHCP? Thanks. Lee Manko