With the arguments I sent you the error about connection refused should have disappeared. Let's try to force all traffic over the first TCP interface eth3. Try the following flags to your mpirun:
--mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3 George. On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy <manumachu.re...@gmail.com> wrote: > > Hi George, > > I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I > executed the following command: > > *shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile* > <the same output and hang> > > Please let me know if there are options to mpirun (apart from -v) to get > verbose output to understand what is happening. > > > On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> Manumachu, >> >> Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as >> long as they don't try to connect to each other using these addresses. A >> simple fix is to prevent OMPI from using the supposedly local mic0 and mic1 >> IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and >> things should start working better. >> >> George. >> >> >> >> On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com> >> wrote: >> >> >> Dear OpenMPI Users, >> >> I request your help to resolve a hang in my OpenMPI application. >> >> My OpenMPI application hangs in MPI_Comm_split() operation. The code for >> this simple application is at the end of this email. Broadcast works fine. >> >> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has >> 2 mic cards. Please note that although there are mic cards, I do not use >> mic cards in my OpenMPI application. >> >> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in >> both the versions. OpenMPI is installed using the following command: >> >> ./configure >> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc >> -fPIC" CXX="icpc -fPIC" >> >> I have made sure I have turned off the firewall using the following >> commands: >> >> sudo service iptables save >> sudo service iptables stop >> sudo chkconfig iptables off >> >> I made sure the mic cards are online and healthy. I am able to login to >> the mic cards. >> >> I use an appfile to launch 2 processes on each node. >> >> I have also attached the "ifconfig" list for each node. Could this >> problem be related to multiple network interfaces (from the application >> output also shown at the end of the email)? >> >> Please let me know if you need further information and I look forward to >> your suggestions. >> >> Best Regards >> Manumachu >> >> *Application* >> >> #include <stdio.h> >> #include <mpi.h> >> >> int main(int argc, char** argv) >> { >> int me, hostnamelen; >> char hostname[MPI_MAX_PROCESSOR_NAME]; >> >> MPI_Init(&argc, &argv); >> >> MPI_Get_processor_name(hostname, &hostnamelen); >> >> MPI_Comm_rank(MPI_COMM_WORLD, &me); >> printf("Hostname %s: Me is %d.\n", hostname, me); >> >> int a; >> MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); >> >> printf("Hostname %s: Me %d broadcasted.\n", hostname, me); >> >> MPI_Comm intraNodeComm; >> int rc = MPI_Comm_split( >> MPI_COMM_WORLD, >> me, >> me, >> &intraNodeComm >> ); >> >> if (rc != MPI_SUCCESS) >> { >> printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); >> return -1; >> } >> >> printf("Hostname %s: Me %d after comm split.\n", hostname, me); >> MPI_Comm_free(&intraNodeComm); >> MPI_Finalize(); >> >> return 0; >> } >> >> *Application output* >> >> Hostname server5: Me is 0. >> Hostname server5: Me is 1. >> Hostname server5: Me 1 broadcasted. >> Hostname server5: Me 0 broadcasted. >> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >> connect() to 172.31.1.254 failed: Connection refused (111) >> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >> connect() to 172.31.1.254 failed: Connection refused (111) >> Hostname server2: Me is 2. >> Hostname server2: Me 2 broadcasted. >> Hostname server2: Me is 3. >> Hostname server2: Me 3 broadcasted. >> >> *server2 ifconfig* >> >> eth0 Link encap:Ethernet... >> UP BROADCAST MULTICAST MTU:1500 Metric:1 >> ... >> eth1 Link encap:Ethernet... >> UP BROADCAST MULTICAST MTU:1500 Metric:1 >> ... >> eth2 Link encap:Ethernet... >> UP BROADCAST MULTICAST MTU:1500 Metric:1 >> ... >> eth3 Link encap:Ethernet... >> inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 >> inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> ... >> lo Link encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> UP LOOPBACK RUNNING MTU:65536 Metric:1 >> ... >> >> mic0 Link encap:Ethernet >> inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 >> ... >> >> mic1 Link encap:Ethernet... >> inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 >> ... >> >> *server5 ifconfig* >> eth0 Link encap:Ethernet... >> UP BROADCAST MULTICAST MTU:1500 Metric:1 >> ... >> >> eth1 Link encap:Ethernet... >> UP BROADCAST MULTICAST MTU:1500 Metric:1 >> ... >> >> eth2 Link encap:Ethernet... >> UP BROADCAST MULTICAST MTU:1500 Metric:1 >> ... >> >> eth3 Link encap:Ethernet... >> inet addr:172.17.27.14 Bcast:172.17.27.255 Mask:255.255.255.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> ... >> >> lo Link encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> ... >> >> mic0 Link encap:Ethernet... >> inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 >> UP BROADCAST RUNNING MTU:64512 Metric:1 >> ... >> >> mic1 Link encap:Ethernet... >> inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 >> ... >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26780.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26784.php >> > > > > -- > Best Regards > Ravi > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26788.php >