having trouble with running a MPI program on a linux (centos 5.7) cluster.
my cluster has 16 nodes and 12 cpu cores for each node.
each node has 2 connections to a switch, eth0 and eth2.
ip addresses of the nodes are set as :
eth0 : 192.168.1.1/16
eth2 : 192.168.1.101/106
i would like to use eth2 for MPI communications.

i tried to run a program as :
mpiexec --mca btl_tcp_if_include eth2 --mca btl_tcp_if_exclude lo,eth0
-hostfile hostfile -n 192 ./my_program

the file 'hostfile' has lines such as:
node101 slots=12
...

and /etc/hosts file has lines such as:
192.168.1.1 node001
...
192.168.1.101 node101
...

but the program just simply hangs/stalls at MPI_Bcast(...) or
MPI_Barrier(...).
MPI_Init(), MPI_Comm_rank(), and MPI_Comm_size() report exact results.

if the program is run when only the eth0 is set up (ifconfig eth2 down for
all nodes and use another hostfile that contains node001 - node016), it
runs just fine.

any help would be appreciated.
thanks in advance.

-- K. H. Pae

Reply via email to