Dear OpenMPI Users, I request your help to resolve a hang in my OpenMPI application.
My OpenMPI application hangs in MPI_Comm_split() operation. The code for this simple application is at the end of this email. Broadcast works fine. My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has 2 mic cards. Please note that although there are mic cards, I do not use mic cards in my OpenMPI application. I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in both the versions. OpenMPI is installed using the following command: ./configure --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc -fPIC" CXX="icpc -fPIC" I have made sure I have turned off the firewall using the following commands: sudo service iptables save sudo service iptables stop sudo chkconfig iptables off I made sure the mic cards are online and healthy. I am able to login to the mic cards. I use an appfile to launch 2 processes on each node. I have also attached the "ifconfig" list for each node. Could this problem be related to multiple network interfaces (from the application output also shown at the end of the email)? Please let me know if you need further information and I look forward to your suggestions. Best Regards Manumachu *Application* #include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int me, hostnamelen; char hostname[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Get_processor_name(hostname, &hostnamelen); MPI_Comm_rank(MPI_COMM_WORLD, &me); printf("Hostname %s: Me is %d.\n", hostname, me); int a; MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); printf("Hostname %s: Me %d broadcasted.\n", hostname, me); MPI_Comm intraNodeComm; int rc = MPI_Comm_split( MPI_COMM_WORLD, me, me, &intraNodeComm ); if (rc != MPI_SUCCESS) { printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); return -1; } printf("Hostname %s: Me %d after comm split.\n", hostname, me); MPI_Comm_free(&intraNodeComm); MPI_Finalize(); return 0; } *Application output* Hostname server5: Me is 0. Hostname server5: Me is 1. Hostname server5: Me 1 broadcasted. Hostname server5: Me 0 broadcasted. [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.1.254 failed: Connection refused (111) [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.1.254 failed: Connection refused (111) Hostname server2: Me is 2. Hostname server2: Me 2 broadcasted. Hostname server2: Me is 3. Hostname server2: Me 3 broadcasted. *server2 ifconfig* eth0 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth1 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth2 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth3 Link encap:Ethernet... inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 ... lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 ... mic0 Link encap:Ethernet inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 ... mic1 Link encap:Ethernet... inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 ... *server5 ifconfig* eth0 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth1 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth2 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth3 Link encap:Ethernet... inet addr:172.17.27.14 Bcast:172.17.27.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 ... lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 ... mic0 Link encap:Ethernet... inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MTU:64512 Metric:1 ... mic1 Link encap:Ethernet... inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 ...