Manumachu,

Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as long 
as they don't try to connect to each other using these addresses. A simple fix 
is to prevent OMPI from using the supposedly local mic0 and mic1 IP. Add --mca 
btl_tcp_if_exclude mic0,mic1 to your mpirun command and things should start 
working better.

George.



> On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com> wrote:
> 
> 
> Dear OpenMPI Users,
> 
> I request your help to resolve a hang in my OpenMPI application.
> 
> My OpenMPI application hangs in MPI_Comm_split() operation. The code for this 
> simple application is at the end of this email. Broadcast works fine.
> 
> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has 2 
> mic cards. Please note that although there are mic cards, I do not use mic 
> cards in my OpenMPI application.
> 
> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in 
> both the versions. OpenMPI is installed using the following command:
> 
> ./configure 
> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc 
> -fPIC" CXX="icpc -fPIC"
> 
> I have made sure I have turned off the firewall using the following commands:
> 
> sudo service iptables save
> sudo service iptables stop
> sudo chkconfig iptables off
> 
> I made sure the mic cards are online and healthy. I am able to login to the 
> mic cards.
> 
> I use an appfile to launch 2 processes on each node.
> 
> I have also attached the "ifconfig" list for each node. Could this problem be 
> related to multiple network interfaces (from the application output also 
> shown at the end of the email)?
> 
> Please let me know if you need further information and I look forward to your 
> suggestions.
> 
> Best Regards
> Manumachu
> 
> Application
> 
> #include <stdio.h>
> #include <mpi.h>
> 
> int main(int argc, char** argv)
> {
>     int me, hostnamelen;
>     char hostname[MPI_MAX_PROCESSOR_NAME];
> 
>     MPI_Init(&argc, &argv);
> 
>     MPI_Get_processor_name(hostname, &hostnamelen);
> 
>     MPI_Comm_rank(MPI_COMM_WORLD, &me);
>     printf("Hostname %s: Me is %d.\n", hostname, me);
> 
>     int a;
>     MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
> 
>     printf("Hostname %s: Me %d broadcasted.\n", hostname, me);
> 
>     MPI_Comm intraNodeComm;
>     int rc = MPI_Comm_split(
>                 MPI_COMM_WORLD,
>                 me,
>                 me,
>                 &intraNodeComm
>     );
> 
>     if (rc != MPI_SUCCESS)
>     {
>        printf("MAIN: Problems MPI_Comm_split...Exiting...\n");
>        return -1;
>     }
> 
>     printf("Hostname %s: Me %d after comm split.\n", hostname, me);
>     MPI_Comm_free(&intraNodeComm);
>     MPI_Finalize();
> 
>     return 0;
> }
> 
> Application output
> 
> Hostname server5: Me is 0.
> Hostname server5: Me is 1.
> Hostname server5: Me 1 broadcasted.
> Hostname server5: Me 0 broadcasted.
> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 172.31.1.254 failed: Connection refused (111)
> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 172.31.1.254 failed: Connection refused (111)
> Hostname server2: Me is 2.
> Hostname server2: Me 2 broadcasted.
> Hostname server2: Me is 3.
> Hostname server2: Me 3 broadcasted.
> 
> server2 ifconfig
> 
> eth0      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> eth1      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> eth2      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> eth3      Link encap:Ethernet...
>           inet addr:172.17.27.17  Bcast:172.17.27.255  Mask:255.255.255.0
>           inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           ...
> lo        Link encap:Local Loopback  
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>           ...
> 
> mic0      Link encap:Ethernet
>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>           ...
> 
> mic1      Link encap:Ethernet...
>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>           ...
> 
> server5 ifconfig
> eth0      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> 
> eth1      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> 
> eth2      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> 
> eth3      Link encap:Ethernet...
>           inet addr:172.17.27.14  Bcast:172.17.27.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           ...
> 
> lo        Link encap:Local Loopback  
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           ...
> 
> mic0      Link encap:Ethernet...
>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING  MTU:64512  Metric:1
>           ...
> 
> mic1      Link encap:Ethernet...
>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>           ...
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26780.php

Reply via email to