If any communication will be between two mics on the same node or between a mic and its host I suggest using the scif btl instead of tcp. You will see a factor of 10 or more improvement in latency by using the scif interface.
-Nathan On Tue, May 05, 2015 at 10:39:47AM +0530, Manumachu Reddy wrote: > Hi George, > > Sorry for the delay in writing to you. > > Your latest suggestion has worked admirably well. > > Thanks a lot for your help. > On Sun, Apr 26, 2015 at 9:32 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > With the arguments I sent you the error about connection refused should > have disappeared. Let's try to force all traffic over the first TCP > interface eth3. Try the following flags to your mpirun: > --mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3 > George. > On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy > <manumachu.re...@gmail.com> wrote: > > Hi George, > I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. > I executed the following command: > shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile > <the same output and hang> > Please let me know if there are options to mpirun (apart from -v) to > get verbose output to understand what is happening. > On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > Manumachu, > Both nodes have the same IP for their Phi (mic0 and mic1). This is > OK as long as they don't try to connect to each other using these > addresses. A simple fix is to prevent OMPI from using the supposedly > local mic0 and mic1 IP. Add --mca btl_tcp_if_exclude mic0,mic1 to > your mpirun command and things should start working better. > George. > > On Apr 24, 2015, at 03:32, Manumachu Reddy > <manumachu.re...@gmail.com> wrote: > > Dear OpenMPI Users, > > I request your help to resolve a hang in my OpenMPI application. > > My OpenMPI application hangs in MPI_Comm_split() operation. The > code for this simple application is at the end of this email. > Broadcast works fine. > > My experimental setup comprises of two RHEL6.4 Linux nodes. Each > node has 2 mic cards. Please note that although there are mic > cards, I do not use mic cards in my OpenMPI application. > > I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the > hang in both the versions. OpenMPI is installed using the > following command: > > ./configure > --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC > CC="icc -fPIC" CXX="icpc -fPIC" > > I have made sure I have turned off the firewall using the > following commands: > > sudo service iptables save > sudo service iptables stop > sudo chkconfig iptables off > > I made sure the mic cards are online and healthy. I am able to > login to the mic cards. > > I use an appfile to launch 2 processes on each node. > > I have also attached the "ifconfig" list for each node. Could this > problem be related to multiple network interfaces (from the > application output also shown at the end of the email)? > > Please let me know if you need further information and I look > forward to your suggestions. > > Best Regards > Manumachu > > Application > > #include <stdio.h> > #include <mpi.h> > > int main(int argc, char** argv) > { > int me, hostnamelen; > char hostname[MPI_MAX_PROCESSOR_NAME]; > > MPI_Init(&argc, &argv); > > MPI_Get_processor_name(hostname, &hostnamelen); > > MPI_Comm_rank(MPI_COMM_WORLD, &me); > printf("Hostname %s: Me is %d.\n", hostname, me); > > int a; > MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); > > printf("Hostname %s: Me %d broadcasted.\n", hostname, me); > > MPI_Comm intraNodeComm; > int rc = MPI_Comm_split( > MPI_COMM_WORLD, > me, > me, > &intraNodeComm > ); > > if (rc != MPI_SUCCESS) > { > printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); > return -1; > } > > printf("Hostname %s: Me %d after comm split.\n", hostname, > me); > MPI_Comm_free(&intraNodeComm); > MPI_Finalize(); > > return 0; > } > > Application output > > Hostname server5: Me is 0. > Hostname server5: Me is 1. > Hostname server5: Me 1 broadcasted. > Hostname server5: Me 0 broadcasted. > > [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > > [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > Hostname server2: Me is 2. > Hostname server2: Me 2 broadcasted. > Hostname server2: Me is 3. > Hostname server2: Me 3 broadcasted. > > server2 ifconfig > > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth3 Link encap:Ethernet... > inet addr:172.17.27.17 Bcast:172.17.27.255 > Mask:255.255.255.0 > inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:65536 Metric:1 > ... > > mic0 Link encap:Ethernet > inet addr:172.31.1.254 Bcast:172.31.1.255 > Mask:255.255.255.0 > ... > > mic1 Link encap:Ethernet... > inet addr:172.31.2.254 Bcast:172.31.2.255 > Mask:255.255.255.0 > ... > > server5 ifconfig > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > > eth3 Link encap:Ethernet... > inet addr:172.17.27.14 Bcast:172.17.27.255 > Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > ... > > mic0 Link encap:Ethernet... > inet addr:172.31.1.254 Bcast:172.31.1.255 > Mask:255.255.255.0 > UP BROADCAST RUNNING MTU:64512 Metric:1 > ... > > mic1 Link encap:Ethernet... > inet addr:172.31.2.254 Bcast:172.31.2.255 > Mask:255.255.255.0 > ... > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26780.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26784.php > > -- > Best Regards > Ravi > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26788.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26791.php > > -- > Best Regards > Ravi > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26836.php
pgpDrkghuSsUN.pgp
Description: PGP signature