Manumachu, Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as long as they don't try to connect to each other using these addresses. A simple fix is to prevent OMPI from using the supposedly local mic0 and mic1 IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and things should start working better.
George. > On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com> wrote: > > > Dear OpenMPI Users, > > I request your help to resolve a hang in my OpenMPI application. > > My OpenMPI application hangs in MPI_Comm_split() operation. The code for this > simple application is at the end of this email. Broadcast works fine. > > My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has 2 > mic cards. Please note that although there are mic cards, I do not use mic > cards in my OpenMPI application. > > I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in > both the versions. OpenMPI is installed using the following command: > > ./configure > --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc > -fPIC" CXX="icpc -fPIC" > > I have made sure I have turned off the firewall using the following commands: > > sudo service iptables save > sudo service iptables stop > sudo chkconfig iptables off > > I made sure the mic cards are online and healthy. I am able to login to the > mic cards. > > I use an appfile to launch 2 processes on each node. > > I have also attached the "ifconfig" list for each node. Could this problem be > related to multiple network interfaces (from the application output also > shown at the end of the email)? > > Please let me know if you need further information and I look forward to your > suggestions. > > Best Regards > Manumachu > > Application > > #include <stdio.h> > #include <mpi.h> > > int main(int argc, char** argv) > { > int me, hostnamelen; > char hostname[MPI_MAX_PROCESSOR_NAME]; > > MPI_Init(&argc, &argv); > > MPI_Get_processor_name(hostname, &hostnamelen); > > MPI_Comm_rank(MPI_COMM_WORLD, &me); > printf("Hostname %s: Me is %d.\n", hostname, me); > > int a; > MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); > > printf("Hostname %s: Me %d broadcasted.\n", hostname, me); > > MPI_Comm intraNodeComm; > int rc = MPI_Comm_split( > MPI_COMM_WORLD, > me, > me, > &intraNodeComm > ); > > if (rc != MPI_SUCCESS) > { > printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); > return -1; > } > > printf("Hostname %s: Me %d after comm split.\n", hostname, me); > MPI_Comm_free(&intraNodeComm); > MPI_Finalize(); > > return 0; > } > > Application output > > Hostname server5: Me is 0. > Hostname server5: Me is 1. > Hostname server5: Me 1 broadcasted. > Hostname server5: Me 0 broadcasted. > [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > Hostname server2: Me is 2. > Hostname server2: Me 2 broadcasted. > Hostname server2: Me is 3. > Hostname server2: Me 3 broadcasted. > > server2 ifconfig > > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth3 Link encap:Ethernet... > inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 > inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:65536 Metric:1 > ... > > mic0 Link encap:Ethernet > inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 > ... > > mic1 Link encap:Ethernet... > inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 > ... > > server5 ifconfig > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth3 Link encap:Ethernet... > inet addr:172.17.27.14 Bcast:172.17.27.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > ... > > mic0 Link encap:Ethernet... > inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MTU:64512 Metric:1 > ... > > mic1 Link encap:Ethernet... > inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 > ... > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26780.php