Hi George, I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I executed the following command:
*shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile* <the same output and hang> Please let me know if there are options to mpirun (apart from -v) to get verbose output to understand what is happening. On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > Manumachu, > > Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as > long as they don't try to connect to each other using these addresses. A > simple fix is to prevent OMPI from using the supposedly local mic0 and mic1 > IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and > things should start working better. > > George. > > > > On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com> > wrote: > > > Dear OpenMPI Users, > > I request your help to resolve a hang in my OpenMPI application. > > My OpenMPI application hangs in MPI_Comm_split() operation. The code for > this simple application is at the end of this email. Broadcast works fine. > > My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has > 2 mic cards. Please note that although there are mic cards, I do not use > mic cards in my OpenMPI application. > > I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in > both the versions. OpenMPI is installed using the following command: > > ./configure > --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc > -fPIC" CXX="icpc -fPIC" > > I have made sure I have turned off the firewall using the following > commands: > > sudo service iptables save > sudo service iptables stop > sudo chkconfig iptables off > > I made sure the mic cards are online and healthy. I am able to login to > the mic cards. > > I use an appfile to launch 2 processes on each node. > > I have also attached the "ifconfig" list for each node. Could this problem > be related to multiple network interfaces (from the application output also > shown at the end of the email)? > > Please let me know if you need further information and I look forward to > your suggestions. > > Best Regards > Manumachu > > *Application* > > #include <stdio.h> > #include <mpi.h> > > int main(int argc, char** argv) > { > int me, hostnamelen; > char hostname[MPI_MAX_PROCESSOR_NAME]; > > MPI_Init(&argc, &argv); > > MPI_Get_processor_name(hostname, &hostnamelen); > > MPI_Comm_rank(MPI_COMM_WORLD, &me); > printf("Hostname %s: Me is %d.\n", hostname, me); > > int a; > MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); > > printf("Hostname %s: Me %d broadcasted.\n", hostname, me); > > MPI_Comm intraNodeComm; > int rc = MPI_Comm_split( > MPI_COMM_WORLD, > me, > me, > &intraNodeComm > ); > > if (rc != MPI_SUCCESS) > { > printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); > return -1; > } > > printf("Hostname %s: Me %d after comm split.\n", hostname, me); > MPI_Comm_free(&intraNodeComm); > MPI_Finalize(); > > return 0; > } > > *Application output* > > Hostname server5: Me is 0. > Hostname server5: Me is 1. > Hostname server5: Me 1 broadcasted. > Hostname server5: Me 0 broadcasted. > [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > Hostname server2: Me is 2. > Hostname server2: Me 2 broadcasted. > Hostname server2: Me is 3. > Hostname server2: Me 3 broadcasted. > > *server2 ifconfig* > > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth3 Link encap:Ethernet... > inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 > inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:65536 Metric:1 > ... > > mic0 Link encap:Ethernet > inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 > ... > > mic1 Link encap:Ethernet... > inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 > ... > > *server5 ifconfig* > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth3 Link encap:Ethernet... > inet addr:172.17.27.14 Bcast:172.17.27.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > ... > > mic0 Link encap:Ethernet... > inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MTU:64512 Metric:1 > ... > > mic1 Link encap:Ethernet... > inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 > ... > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26780.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26784.php > -- Best Regards Ravi