[OMPI users] Hang in MPI_Comm_split in 2 RHEL Linux nodes with INTEL MIC cards
Dear OpenMPI Users, I request your help to resolve a hang in my OpenMPI application. My OpenMPI application hangs in MPI_Comm_split() operation. The code for this simple application is at the end of this email. Broadcast works fine. My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has 2 mic cards. Please note that although there are mic cards, I do not use mic cards in my OpenMPI application. I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in both the versions. OpenMPI is installed using the following command: ./configure --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc -fPIC" CXX="icpc -fPIC" I have made sure I have turned off the firewall using the following commands: sudo service iptables save sudo service iptables stop sudo chkconfig iptables off I made sure the mic cards are online and healthy. I am able to login to the mic cards. I use an appfile to launch 2 processes on each node. I have also attached the "ifconfig" list for each node. Could this problem be related to multiple network interfaces (from the application output also shown at the end of the email)? Please let me know if you need further information and I look forward to your suggestions. Best Regards Manumachu *Application* #include #include int main(int argc, char** argv) { int me, hostnamelen; char hostname[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Get_processor_name(hostname, &hostnamelen); MPI_Comm_rank(MPI_COMM_WORLD, &me); printf("Hostname %s: Me is %d.\n", hostname, me); int a; MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); printf("Hostname %s: Me %d broadcasted.\n", hostname, me); MPI_Comm intraNodeComm; int rc = MPI_Comm_split( MPI_COMM_WORLD, me, me, &intraNodeComm ); if (rc != MPI_SUCCESS) { printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); return -1; } printf("Hostname %s: Me %d after comm split.\n", hostname, me); MPI_Comm_free(&intraNodeComm); MPI_Finalize(); return 0; } *Application output* Hostname server5: Me is 0. Hostname server5: Me is 1. Hostname server5: Me 1 broadcasted. Hostname server5: Me 0 broadcasted. [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.1.254 failed: Connection refused (111) [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.1.254 failed: Connection refused (111) Hostname server2: Me is 2. Hostname server2: Me 2 broadcasted. Hostname server2: Me is 3. Hostname server2: Me 3 broadcasted. *server2 ifconfig* eth0 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth1 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth2 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth3 Link encap:Ethernet... inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 ... loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 ... mic0 Link encap:Ethernet inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 ... mic1 Link encap:Ethernet... inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 ... *server5 ifconfig* eth0 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth1 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth2 Link encap:Ethernet... UP BROADCAST MULTICAST MTU:1500 Metric:1 ... eth3 Link encap:Ethernet... inet addr:172.17.27.14 Bcast:172.17.27.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 ... loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 ... mic0 Link encap:Ethernet... inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MTU:64512 Metric:1 ... mic1 Link encap:Ethernet... inet addr:172.31.2.254 Bcast:172.31.2.255 Mask:255.255.255.0 ...
Re: [OMPI users] Hang in MPI_Comm_split in 2 RHEL Linux nodes with INTEL MIC cards
Hi George, I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I executed the following command: *shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile* Please let me know if there are options to mpirun (apart from -v) to get verbose output to understand what is happening. On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca wrote: > Manumachu, > > Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as > long as they don't try to connect to each other using these addresses. A > simple fix is to prevent OMPI from using the supposedly local mic0 and mic1 > IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and > things should start working better. > > George. > > > > On Apr 24, 2015, at 03:32, Manumachu Reddy > wrote: > > > Dear OpenMPI Users, > > I request your help to resolve a hang in my OpenMPI application. > > My OpenMPI application hangs in MPI_Comm_split() operation. The code for > this simple application is at the end of this email. Broadcast works fine. > > My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has > 2 mic cards. Please note that although there are mic cards, I do not use > mic cards in my OpenMPI application. > > I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in > both the versions. OpenMPI is installed using the following command: > > ./configure > --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc > -fPIC" CXX="icpc -fPIC" > > I have made sure I have turned off the firewall using the following > commands: > > sudo service iptables save > sudo service iptables stop > sudo chkconfig iptables off > > I made sure the mic cards are online and healthy. I am able to login to > the mic cards. > > I use an appfile to launch 2 processes on each node. > > I have also attached the "ifconfig" list for each node. Could this problem > be related to multiple network interfaces (from the application output also > shown at the end of the email)? > > Please let me know if you need further information and I look forward to > your suggestions. > > Best Regards > Manumachu > > *Application* > > #include > #include > > int main(int argc, char** argv) > { > int me, hostnamelen; > char hostname[MPI_MAX_PROCESSOR_NAME]; > > MPI_Init(&argc, &argv); > > MPI_Get_processor_name(hostname, &hostnamelen); > > MPI_Comm_rank(MPI_COMM_WORLD, &me); > printf("Hostname %s: Me is %d.\n", hostname, me); > > int a; > MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); > > printf("Hostname %s: Me %d broadcasted.\n", hostname, me); > > MPI_Comm intraNodeComm; > int rc = MPI_Comm_split( > MPI_COMM_WORLD, > me, > me, > &intraNodeComm > ); > > if (rc != MPI_SUCCESS) > { >printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); >return -1; > } > > printf("Hostname %s: Me %d after comm split.\n", hostname, me); > MPI_Comm_free(&intraNodeComm); > MPI_Finalize(); > > return 0; > } > > *Application output* > > Hostname server5: Me is 0. > Hostname server5: Me is 1. > Hostname server5: Me 1 broadcasted. > Hostname server5: Me 0 broadcasted. > [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 172.31.1.254 failed: Connection refused (111) > Hostname server2: Me is 2. > Hostname server2: Me 2 broadcasted. > Hostname server2: Me is 3. > Hostname server2: Me 3 broadcasted. > > *server2 ifconfig* > > eth0 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth1 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth2 Link encap:Ethernet... > UP BROADCAST MULTICAST MTU:1500 Metric:1 > ... > eth3 Link encap:Ethernet... > inet addr:172.17.27.17 Bcast:172.17.27.255 Mask:255.255.255.0 > inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > ... > loLink encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:65536 Metric:1 > ... > > mic0 Link encap:Ethernet > inet addr:1
Re: [OMPI users] Hang in MPI_Comm_split in 2 RHEL Linux nodes with INTEL MIC cards
Hi George, Sorry for the delay in writing to you. Your latest suggestion has worked admirably well. Thanks a lot for your help. On Sun, Apr 26, 2015 at 9:32 PM, George Bosilca wrote: > With the arguments I sent you the error about connection refused should > have disappeared. Let's try to force all traffic over the first TCP > interface eth3. Try the following flags to your mpirun: > > --mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3 > > George. > > > On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy < > manumachu.re...@gmail.com> wrote: > >> >> Hi George, >> >> I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I >> executed the following command: >> >> *shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile* >> >> >> Please let me know if there are options to mpirun (apart from -v) to get >> verbose output to understand what is happening. >> >> >> On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca >> wrote: >> >>> Manumachu, >>> >>> Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as >>> long as they don't try to connect to each other using these addresses. A >>> simple fix is to prevent OMPI from using the supposedly local mic0 and mic1 >>> IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and >>> things should start working better. >>> >>> George. >>> >>> >>> >>> On Apr 24, 2015, at 03:32, Manumachu Reddy >>> wrote: >>> >>> >>> Dear OpenMPI Users, >>> >>> I request your help to resolve a hang in my OpenMPI application. >>> >>> My OpenMPI application hangs in MPI_Comm_split() operation. The code for >>> this simple application is at the end of this email. Broadcast works fine. >>> >>> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node >>> has 2 mic cards. Please note that although there are mic cards, I do not >>> use mic cards in my OpenMPI application. >>> >>> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang >>> in both the versions. OpenMPI is installed using the following command: >>> >>> ./configure >>> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc >>> -fPIC" CXX="icpc -fPIC" >>> >>> I have made sure I have turned off the firewall using the following >>> commands: >>> >>> sudo service iptables save >>> sudo service iptables stop >>> sudo chkconfig iptables off >>> >>> I made sure the mic cards are online and healthy. I am able to login to >>> the mic cards. >>> >>> I use an appfile to launch 2 processes on each node. >>> >>> I have also attached the "ifconfig" list for each node. Could this >>> problem be related to multiple network interfaces (from the application >>> output also shown at the end of the email)? >>> >>> Please let me know if you need further information and I look forward to >>> your suggestions. >>> >>> Best Regards >>> Manumachu >>> >>> *Application* >>> >>> #include >>> #include >>> >>> int main(int argc, char** argv) >>> { >>> int me, hostnamelen; >>> char hostname[MPI_MAX_PROCESSOR_NAME]; >>> >>> MPI_Init(&argc, &argv); >>> >>> MPI_Get_processor_name(hostname, &hostnamelen); >>> >>> MPI_Comm_rank(MPI_COMM_WORLD, &me); >>> printf("Hostname %s: Me is %d.\n", hostname, me); >>> >>> int a; >>> MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD); >>> >>> printf("Hostname %s: Me %d broadcasted.\n", hostname, me); >>> >>> MPI_Comm intraNodeComm; >>> int rc = MPI_Comm_split( >>> MPI_COMM_WORLD, >>> me, >>> me, >>> &intraNodeComm >>> ); >>> >>> if (rc != MPI_SUCCESS) >>> { >>>printf("MAIN: Problems MPI_Comm_split...Exiting...\n"); >>>return -1; >>> } >>> >>> printf("Hostname %s: Me %d after comm split.\n", hostname, me); >>> MPI_Comm_free(&intraNodeComm); >>> MPI_Finalize(); >>> >>>