Hi George,

I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I
executed the following command:

*shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile*
<the same output and hang>

Please let me know if there are options to mpirun (apart from -v) to get
verbose output to understand what is happening.


On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Manumachu,
>
> Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as
> long as they don't try to connect to each other using these addresses. A
> simple fix is to prevent OMPI from using the supposedly local mic0 and mic1
> IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and
> things should start working better.
>
> George.
>
>
>
> On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com>
> wrote:
>
>
> Dear OpenMPI Users,
>
> I request your help to resolve a hang in my OpenMPI application.
>
> My OpenMPI application hangs in MPI_Comm_split() operation. The code for
> this simple application is at the end of this email. Broadcast works fine.
>
> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has
> 2 mic cards. Please note that although there are mic cards, I do not use
> mic cards in my OpenMPI application.
>
> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in
> both the versions. OpenMPI is installed using the following command:
>
> ./configure
> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc
> -fPIC" CXX="icpc -fPIC"
>
> I have made sure I have turned off the firewall using the following
> commands:
>
> sudo service iptables save
> sudo service iptables stop
> sudo chkconfig iptables off
>
> I made sure the mic cards are online and healthy. I am able to login to
> the mic cards.
>
> I use an appfile to launch 2 processes on each node.
>
> I have also attached the "ifconfig" list for each node. Could this problem
> be related to multiple network interfaces (from the application output also
> shown at the end of the email)?
>
> Please let me know if you need further information and I look forward to
> your suggestions.
>
> Best Regards
> Manumachu
>
> *Application*
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main(int argc, char** argv)
> {
>     int me, hostnamelen;
>     char hostname[MPI_MAX_PROCESSOR_NAME];
>
>     MPI_Init(&argc, &argv);
>
>     MPI_Get_processor_name(hostname, &hostnamelen);
>
>     MPI_Comm_rank(MPI_COMM_WORLD, &me);
>     printf("Hostname %s: Me is %d.\n", hostname, me);
>
>     int a;
>     MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
>
>     printf("Hostname %s: Me %d broadcasted.\n", hostname, me);
>
>     MPI_Comm intraNodeComm;
>     int rc = MPI_Comm_split(
>                 MPI_COMM_WORLD,
>                 me,
>                 me,
>                 &intraNodeComm
>     );
>
>     if (rc != MPI_SUCCESS)
>     {
>        printf("MAIN: Problems MPI_Comm_split...Exiting...\n");
>        return -1;
>     }
>
>     printf("Hostname %s: Me %d after comm split.\n", hostname, me);
>     MPI_Comm_free(&intraNodeComm);
>     MPI_Finalize();
>
>     return 0;
> }
>
> *Application output*
>
> Hostname server5: Me is 0.
> Hostname server5: Me is 1.
> Hostname server5: Me 1 broadcasted.
> Hostname server5: Me 0 broadcasted.
> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 172.31.1.254 failed: Connection refused (111)
> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 172.31.1.254 failed: Connection refused (111)
> Hostname server2: Me is 2.
> Hostname server2: Me 2 broadcasted.
> Hostname server2: Me is 3.
> Hostname server2: Me 3 broadcasted.
>
> *server2 ifconfig*
>
> eth0      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> eth1      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> eth2      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
> eth3      Link encap:Ethernet...
>           inet addr:172.17.27.17  Bcast:172.17.27.255  Mask:255.255.255.0
>           inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           ...
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>           ...
>
> mic0      Link encap:Ethernet
>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>           ...
>
> mic1      Link encap:Ethernet...
>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>           ...
>
> *server5 ifconfig*
> eth0      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
>
> eth1      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
>
> eth2      Link encap:Ethernet...
>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>           ...
>
> eth3      Link encap:Ethernet...
>           inet addr:172.17.27.14  Bcast:172.17.27.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           ...
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           ...
>
> mic0      Link encap:Ethernet...
>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING  MTU:64512  Metric:1
>           ...
>
> mic1      Link encap:Ethernet...
>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>           ...
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26780.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26784.php
>



-- 
Best Regards
Ravi

Reply via email to