Dear OpenMPI Users,

I request your help to resolve a hang in my OpenMPI application.

My OpenMPI application hangs in MPI_Comm_split() operation. The code for
this simple application is at the end of this email. Broadcast works fine.

My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has 2
mic cards. Please note that although there are mic cards, I do not use mic
cards in my OpenMPI application.

I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in
both the versions. OpenMPI is installed using the following command:

./configure
--prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc
-fPIC" CXX="icpc -fPIC"

I have made sure I have turned off the firewall using the following
commands:

sudo service iptables save
sudo service iptables stop
sudo chkconfig iptables off

I made sure the mic cards are online and healthy. I am able to login to the
mic cards.

I use an appfile to launch 2 processes on each node.

I have also attached the "ifconfig" list for each node. Could this problem
be related to multiple network interfaces (from the application output also
shown at the end of the email)?

Please let me know if you need further information and I look forward to
your suggestions.

Best Regards
Manumachu

*Application*

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv)
{
    int me, hostnamelen;
    char hostname[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);

    MPI_Get_processor_name(hostname, &hostnamelen);

    MPI_Comm_rank(MPI_COMM_WORLD, &me);
    printf("Hostname %s: Me is %d.\n", hostname, me);

    int a;
    MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);

    printf("Hostname %s: Me %d broadcasted.\n", hostname, me);

    MPI_Comm intraNodeComm;
    int rc = MPI_Comm_split(
                MPI_COMM_WORLD,
                me,
                me,
                &intraNodeComm
    );

    if (rc != MPI_SUCCESS)
    {
       printf("MAIN: Problems MPI_Comm_split...Exiting...\n");
       return -1;
    }

    printf("Hostname %s: Me %d after comm split.\n", hostname, me);
    MPI_Comm_free(&intraNodeComm);
    MPI_Finalize();

    return 0;
}

*Application output*

Hostname server5: Me is 0.
Hostname server5: Me is 1.
Hostname server5: Me 1 broadcasted.
Hostname server5: Me 0 broadcasted.
[server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
connect() to 172.31.1.254 failed: Connection refused (111)
[server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
connect() to 172.31.1.254 failed: Connection refused (111)
Hostname server2: Me is 2.
Hostname server2: Me 2 broadcasted.
Hostname server2: Me is 3.
Hostname server2: Me 3 broadcasted.

*server2 ifconfig*

eth0      Link encap:Ethernet...
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          ...
eth1      Link encap:Ethernet...
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          ...
eth2      Link encap:Ethernet...
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          ...
eth3      Link encap:Ethernet...
          inet addr:172.17.27.17  Bcast:172.17.27.255  Mask:255.255.255.0
          inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          ...
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          ...

mic0      Link encap:Ethernet
          inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
          ...

mic1      Link encap:Ethernet...
          inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
          ...

*server5 ifconfig*
eth0      Link encap:Ethernet...
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          ...

eth1      Link encap:Ethernet...
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          ...

eth2      Link encap:Ethernet...
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          ...

eth3      Link encap:Ethernet...
          inet addr:172.17.27.14  Bcast:172.17.27.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          ...

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          ...

mic0      Link encap:Ethernet...
          inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING  MTU:64512  Metric:1
          ...

mic1      Link encap:Ethernet...
          inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
          ...

Reply via email to