If any communication will be between two mics on the same node or
between a mic and its host I suggest using the scif btl instead of
tcp. You will see a factor of 10 or more improvement in latency by using
the scif interface.

-Nathan

On Tue, May 05, 2015 at 10:39:47AM +0530, Manumachu Reddy wrote:
>    Hi George,
> 
>    Sorry for the delay in writing to you.
> 
>    Your latest suggestion has worked admirably well.
> 
>    Thanks a lot for your help.
>    On Sun, Apr 26, 2015 at 9:32 PM, George Bosilca <bosi...@icl.utk.edu>
>    wrote:
> 
>      With the arguments I sent you the error about connection refused should
>      have disappeared. Let's try to force all traffic over the first TCP
>      interface eth3. Try the following flags to your mpirun: 
>      --mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3
>        George.
>      On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy
>      <manumachu.re...@gmail.com> wrote:
> 
>        Hi George,
>        I am afraid the suggestion to use bcl_tcp_if_exclude has not applied.
>        I executed the following command:
>        shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile
>        <the same output and hang>
>        Please let me know if there are options to mpirun (apart from -v) to
>        get verbose output to understand what is happening.
>        On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu>
>        wrote:
> 
>          Manumachu,
>          Both nodes have the same IP for their Phi (mic0 and mic1). This is
>          OK as long as they don't try to connect to each other using these
>          addresses. A simple fix is to prevent OMPI from using the supposedly
>          local mic0 and mic1 IP. Add --mca btl_tcp_if_exclude mic0,mic1 to
>          your mpirun command and things should start working better.
>          George.
> 
>          On Apr 24, 2015, at 03:32, Manumachu Reddy
>          <manumachu.re...@gmail.com> wrote:
> 
>            Dear OpenMPI Users,
> 
>            I request your help to resolve a hang in my OpenMPI application.
> 
>            My OpenMPI application hangs in MPI_Comm_split() operation. The
>            code for this simple application is at the end of this email.
>            Broadcast works fine.
> 
>            My experimental setup comprises of two RHEL6.4 Linux nodes. Each
>            node has 2 mic cards. Please note that although there are mic
>            cards, I do not use mic cards in my OpenMPI application.
> 
>            I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the
>            hang in both the versions. OpenMPI is installed using the
>            following command:
> 
>            ./configure
>            --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC
>            CC="icc -fPIC" CXX="icpc -fPIC"
> 
>            I have made sure I have turned off the firewall using the
>            following commands:
> 
>            sudo service iptables save
>            sudo service iptables stop
>            sudo chkconfig iptables off
> 
>            I made sure the mic cards are online and healthy. I am able to
>            login to the mic cards.
> 
>            I use an appfile to launch 2 processes on each node.
> 
>            I have also attached the "ifconfig" list for each node. Could this
>            problem be related to multiple network interfaces (from the
>            application output also shown at the end of the email)?
> 
>            Please let me know if you need further information and I look
>            forward to your suggestions.
> 
>            Best Regards
>            Manumachu
> 
>            Application
> 
>            #include <stdio.h>
>            #include <mpi.h>
> 
>            int main(int argc, char** argv)
>            {
>                int me, hostnamelen;
>                char hostname[MPI_MAX_PROCESSOR_NAME];
> 
>                MPI_Init(&argc, &argv);
> 
>                MPI_Get_processor_name(hostname, &hostnamelen);
> 
>                MPI_Comm_rank(MPI_COMM_WORLD, &me);
>                printf("Hostname %s: Me is %d.\n", hostname, me);
> 
>                int a;
>                MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
> 
>                printf("Hostname %s: Me %d broadcasted.\n", hostname, me);
> 
>                MPI_Comm intraNodeComm;
>                int rc = MPI_Comm_split(
>                            MPI_COMM_WORLD,
>                            me,
>                            me,
>                            &intraNodeComm
>                );
> 
>                if (rc != MPI_SUCCESS)
>                {
>                   printf("MAIN: Problems MPI_Comm_split...Exiting...\n");
>                   return -1;
>                }
> 
>                printf("Hostname %s: Me %d after comm split.\n", hostname,
>            me);
>                MPI_Comm_free(&intraNodeComm);
>                MPI_Finalize();
> 
>                return 0;
>            }
> 
>            Application output
> 
>            Hostname server5: Me is 0.
>            Hostname server5: Me is 1.
>            Hostname server5: Me 1 broadcasted.
>            Hostname server5: Me 0 broadcasted.
>            
> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>            connect() to 172.31.1.254 failed: Connection refused (111)
>            
> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>            connect() to 172.31.1.254 failed: Connection refused (111)
>            Hostname server2: Me is 2.
>            Hostname server2: Me 2 broadcasted.
>            Hostname server2: Me is 3.
>            Hostname server2: Me 3 broadcasted.
> 
>            server2 ifconfig
> 
>            eth0      Link encap:Ethernet...
>                      UP BROADCAST MULTICAST  MTU:1500  Metric:1
>                      ...
>            eth1      Link encap:Ethernet...
>                      UP BROADCAST MULTICAST  MTU:1500  Metric:1
>                      ...
>            eth2      Link encap:Ethernet...
>                      UP BROADCAST MULTICAST  MTU:1500  Metric:1
>                      ...
>            eth3      Link encap:Ethernet...
>                      inet addr:172.17.27.17  Bcast:172.17.27.255 
>            Mask:255.255.255.0
>                      inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link
>                      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>                      ...
>            lo        Link encap:Local Loopback 
>                      inet addr:127.0.0.1  Mask:255.0.0.0
>                      UP LOOPBACK RUNNING  MTU:65536  Metric:1
>                      ...
> 
>            mic0      Link encap:Ethernet
>                      inet addr:172.31.1.254  Bcast:172.31.1.255 
>            Mask:255.255.255.0
>                      ...
> 
>            mic1      Link encap:Ethernet...
>                      inet addr:172.31.2.254  Bcast:172.31.2.255 
>            Mask:255.255.255.0
>                      ...
> 
>            server5 ifconfig
>            eth0      Link encap:Ethernet...
>                      UP BROADCAST MULTICAST  MTU:1500  Metric:1
>                      ...
> 
>            eth1      Link encap:Ethernet...
>                      UP BROADCAST MULTICAST  MTU:1500  Metric:1
>                      ...
> 
>            eth2      Link encap:Ethernet...
>                      UP BROADCAST MULTICAST  MTU:1500  Metric:1
>                      ...
> 
>            eth3      Link encap:Ethernet...
>                      inet addr:172.17.27.14  Bcast:172.17.27.255 
>            Mask:255.255.255.0
>                      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>                      ...
> 
>            lo        Link encap:Local Loopback 
>                      inet addr:127.0.0.1  Mask:255.0.0.0
>                      ...
> 
>            mic0      Link encap:Ethernet...
>                      inet addr:172.31.1.254  Bcast:172.31.1.255 
>            Mask:255.255.255.0
>                      UP BROADCAST RUNNING  MTU:64512  Metric:1
>                      ...
> 
>            mic1      Link encap:Ethernet...
>                      inet addr:172.31.2.254  Bcast:172.31.2.255 
>            Mask:255.255.255.0
>                      ...
> 
>            _______________________________________________
>            users mailing list
>            us...@open-mpi.org
>            Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>            Link to this post:
>            http://www.open-mpi.org/community/lists/users/2015/04/26780.php
> 
>          _______________________________________________
>          users mailing list
>          us...@open-mpi.org
>          Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>          Link to this post:
>          http://www.open-mpi.org/community/lists/users/2015/04/26784.php
> 
>        --
>        Best Regards
>        Ravi
>        _______________________________________________
>        users mailing list
>        us...@open-mpi.org
>        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>        Link to this post:
>        http://www.open-mpi.org/community/lists/users/2015/04/26788.php
> 
>      _______________________________________________
>      users mailing list
>      us...@open-mpi.org
>      Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>      Link to this post:
>      http://www.open-mpi.org/community/lists/users/2015/04/26791.php
> 
>    --
>    Best Regards
>    Ravi

> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26836.php

Attachment: pgpDrkghuSsUN.pgp
Description: PGP signature

Reply via email to