having trouble with running a MPI program on a linux (centos 5.7) cluster. my cluster has 16 nodes and 12 cpu cores for each node. each node has 2 connections to a switch, eth0 and eth2. ip addresses of the nodes are set as : eth0 : 192.168.1.1/16 eth2 : 192.168.1.101/106 i would like to use eth2 for MPI communications.
i tried to run a program as : mpiexec --mca btl_tcp_if_include eth2 --mca btl_tcp_if_exclude lo,eth0 -hostfile hostfile -n 192 ./my_program the file 'hostfile' has lines such as: node101 slots=12 ... and /etc/hosts file has lines such as: 192.168.1.1 node001 ... 192.168.1.101 node101 ... but the program just simply hangs/stalls at MPI_Bcast(...) or MPI_Barrier(...). MPI_Init(), MPI_Comm_rank(), and MPI_Comm_size() report exact results. if the program is run when only the eth0 is set up (ifconfig eth2 down for all nodes and use another hostfile that contains node001 - node016), it runs just fine. any help would be appreciated. thanks in advance. -- K. H. Pae