This is my first attempt at configuring a Beowulf cluster running MPI.  ALL
of the nodes in the cluster are PS3s running Yellow Dog Linux 6.2 and the
host (server) is a Dell i686 Quad-core running Fedora Core 12.  The cluster
is running openMPI v1.4.1 configured (non-homogeneous), compiled and
installed individually on each node and the server.  I have an NSF shared
directory on the host where the application resides after building.  All
nodes have access to the shared volume and can see all files in the shared
volume.  SSH is configured and the server can remote into each node without
using a password and vice versa.  The built-in firewalls (iptables and
ip6tables) are disabled.  The server has a dual Ethernet card. The first
eth1, is used for cluster communications and has a static  IP address of
192.168.0.1.  The second, eth2 is used to communicate with the outside world
and is connected to a corporate network getting a DHCP assigned IP address..


I have a very simple master/slave framework application where the slave does
a simple computation and return the result and the processor name.  The
master farms out 1024 such tasks to the slaves and after finalizing the
master exists.



When I run the code locally on the multiple cores on either the server or
the PS3, the code executes and completes as expected. However, when I have
mpirun spread the work across the nodes, the process hangs waiting for
messages to be passed between the server and the nodes.  What I have
discovered is that if I unplug the second NIC running DHCP the process
executes fine.


I have requested a static IP address from the network admin, but was curious
as to whether anyone has run into this when running DHCP?


Thanks.



Lee Manko

Reply via email to