Hi George, Sorry for the delay in writing to you.
Your latest suggestion has worked admirably well. Thanks a lot for your help. On Sun, Apr 26, 2015 at 9:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > With the arguments I sent you the error about connection refused should > have disappeared. Let's try to force all traffic over the first TCP > interface eth3. Try the following flags to your mpirun: > > --mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3 > > George. > > > On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy < > manumachu.re...@gmail.com> wrote: > >> >> Hi George, >> >> I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I >> executed the following command: >> >> *shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile* >> <the same output and hang> >> >> Please let me know if there are options to mpirun (apart from -v) to get >> verbose output to understand what is happening. >> >> >> On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu> >> wrote: >> >>> Manumachu, >>> >>> Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as >>> long as they don't try to connect to each other using these addresses. A >>> simple fix is to prevent OMPI from using the supposedly local mic0 and mic1 >>> IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and >>> things should start working better. >>> >>> George. >>> >>> >>> >>> On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com> >>> wrote: >>> >>> >>> Dear OpenMPI Users, >>> >>> I request your help to resolve a hang in my OpenMPI application. >>> >>> My OpenMPI application hangs in MPI_Comm_split() operation. The code for >>> this simple application is at the end of this email. Broadcast works fine. >>> >>> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node >>> has 2 mic cards. Please note that although there are mic cards, I do not >>> use mic cards in my OpenMPI application. >>> >>> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang >>> in both the versions. OpenMPI is installed using the following command: >>> >>> ./configure >>> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc >>> -fPIC" CXX="icpc -fPIC" >>> >>> I have made sure I have turned off the firewall using the following >>> commands: >>> >>> sudo service iptables save >>> sudo service iptables stop >>> sudo chkconfig iptables off >>> >>> I made sure the mic cards are online and healthy. I am able to login to >>> the mic cards. >>> >>> I use an appfile to launch 2 processes on each node. >>> >>> I have also attached the "ifconfig" list for each node. Could this >>> problem be related to multiple network interfaces (from the application >>> output also shown at the end of the email)? >>> >>> Please let me know if you need further information and I look forward to >>> your suggestions. >>> >>> Best Regards >>> Manumachu >>> >>> *Application* >>> >>> #include <stdio.h> >>> #include <mpi.h> >>> >>> int main(int argc, char** argv) >>> { >>> int me, hostnamelen; >>> char hostname[MPI_MAX_PROCESSOR_NAME]; >>> >>> MPI_Init(&argc, &argv); >>> >>> MPI_Get_processor_name(hostname, &hostnamelen); >>> >>> MPI_Comm_rank(MPI_COMM_WORLD, &me); >>> printf("Hostname %s: Me is %d.\n", hostname, me); >>> >>> int a; >>> MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); >>> >>> printf("Hostname %s: Me %d broadcasted.\n", hostname, me); >>> >>> MPI_Comm intraNodeComm; >>> int rc = MPI_Comm_split( >>> MPI_COMM_WORLD, >>> me, >>> me, >>> &intraNodeComm >>> ); >>> >>> if (rc != MPI_SUCCESS) >>> { >>> printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); >>> return -1; >>> } >>> >>> printf("Hostname %s: Me %d after comm split.\n", hostname, me); >>> MPI_Comm_free(&intraNodeComm); >>> MPI_Finalize(); >>> >>> return 0; >>> } >>> >>> *Application output* >>> >>> Hostname server5: Me is 0. >>> Hostname server5: Me is 1. >>> Hostname server5: Me 1 broadcasted. >>> Hostname server5: Me 0 broadcasted. >>> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >>> connect() to 172.31.1.254 failed: Connection refused (111) >>> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >>> connect() to 172.31.1.254 failed: Connection refused (111) >>> Hostname server2: Me is 2. >>> Hostname server2: Me 2 broadcasted. >>> Hostname server2: Me is 3. >>> Hostname server2: Me 3 broadcasted. >>> >>> *server2 ifconfig* >>> >>> eth0 Link encap:Ethernet... >>> UP BROADCAST MULTICAST MTU:1500 Metric:1 >>> ... >>> eth1 Link encap:Ethernet... >>> UP BROADCAST MULTICAST MTU:1500 Metric:1 >>> ... >>> eth2 Link encap:Ethernet... >>> UP BROADCAST MULTICAST MTU:1500 Metric:1 >>> ... >>> eth3 Link encap:Ethernet... >>> inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 >>> inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> ... >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> UP LOOPBACK RUNNING MTU:65536 Metric:1 >>> ... >>> >>> mic0 Link encap:Ethernet >>> inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 >>> ... >>> >>> mic1 Link encap:Ethernet... >>> inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 >>> ... >>> >>> *server5 ifconfig* >>> eth0 Link encap:Ethernet... >>> UP BROADCAST MULTICAST MTU:1500 Metric:1 >>> ... >>> >>> eth1 Link encap:Ethernet... >>> UP BROADCAST MULTICAST MTU:1500 Metric:1 >>> ... >>> >>> eth2 Link encap:Ethernet... >>> UP BROADCAST MULTICAST MTU:1500 Metric:1 >>> ... >>> >>> eth3 Link encap:Ethernet... >>> inet addr:172.17.27.14 Bcast:172.17.27.255 Mask:255.255.255.0 >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> ... >>> >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> ... >>> >>> mic0 Link encap:Ethernet... >>> inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 >>> UP BROADCAST RUNNING MTU:64512 Metric:1 >>> ... >>> >>> mic1 Link encap:Ethernet... >>> inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 >>> ... >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26780.php >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26784.php >>> >> >> >> >> -- >> Best Regards >> Ravi >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26788.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26791.php > -- Best Regards Ravi