Hi George,

Sorry for the delay in writing to you.

Your latest suggestion has worked admirably well.

Thanks a lot for your help.


On Sun, Apr 26, 2015 at 9:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> With the arguments I sent you the error about connection refused should
> have disappeared. Let's try to force all traffic over the first TCP
> interface eth3. Try the following flags to your mpirun:
>
> --mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3
>
>   George.
>
>
> On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy <
> manumachu.re...@gmail.com> wrote:
>
>>
>> Hi George,
>>
>> I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I
>> executed the following command:
>>
>> *shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile*
>> <the same output and hang>
>>
>> Please let me know if there are options to mpirun (apart from -v) to get
>> verbose output to understand what is happening.
>>
>>
>> On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu>
>> wrote:
>>
>>> Manumachu,
>>>
>>> Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as
>>> long as they don't try to connect to each other using these addresses. A
>>> simple fix is to prevent OMPI from using the supposedly local mic0 and mic1
>>> IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and
>>> things should start working better.
>>>
>>> George.
>>>
>>>
>>>
>>> On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com>
>>> wrote:
>>>
>>>
>>> Dear OpenMPI Users,
>>>
>>> I request your help to resolve a hang in my OpenMPI application.
>>>
>>> My OpenMPI application hangs in MPI_Comm_split() operation. The code for
>>> this simple application is at the end of this email. Broadcast works fine.
>>>
>>> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node
>>> has 2 mic cards. Please note that although there are mic cards, I do not
>>> use mic cards in my OpenMPI application.
>>>
>>> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang
>>> in both the versions. OpenMPI is installed using the following command:
>>>
>>> ./configure
>>> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc
>>> -fPIC" CXX="icpc -fPIC"
>>>
>>> I have made sure I have turned off the firewall using the following
>>> commands:
>>>
>>> sudo service iptables save
>>> sudo service iptables stop
>>> sudo chkconfig iptables off
>>>
>>> I made sure the mic cards are online and healthy. I am able to login to
>>> the mic cards.
>>>
>>> I use an appfile to launch 2 processes on each node.
>>>
>>> I have also attached the "ifconfig" list for each node. Could this
>>> problem be related to multiple network interfaces (from the application
>>> output also shown at the end of the email)?
>>>
>>> Please let me know if you need further information and I look forward to
>>> your suggestions.
>>>
>>> Best Regards
>>> Manumachu
>>>
>>> *Application*
>>>
>>> #include <stdio.h>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char** argv)
>>> {
>>>     int me, hostnamelen;
>>>     char hostname[MPI_MAX_PROCESSOR_NAME];
>>>
>>>     MPI_Init(&argc, &argv);
>>>
>>>     MPI_Get_processor_name(hostname, &hostnamelen);
>>>
>>>     MPI_Comm_rank(MPI_COMM_WORLD, &me);
>>>     printf("Hostname %s: Me is %d.\n", hostname, me);
>>>
>>>     int a;
>>>     MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>>
>>>     printf("Hostname %s: Me %d broadcasted.\n", hostname, me);
>>>
>>>     MPI_Comm intraNodeComm;
>>>     int rc = MPI_Comm_split(
>>>                 MPI_COMM_WORLD,
>>>                 me,
>>>                 me,
>>>                 &intraNodeComm
>>>     );
>>>
>>>     if (rc != MPI_SUCCESS)
>>>     {
>>>        printf("MAIN: Problems MPI_Comm_split...Exiting...\n");
>>>        return -1;
>>>     }
>>>
>>>     printf("Hostname %s: Me %d after comm split.\n", hostname, me);
>>>     MPI_Comm_free(&intraNodeComm);
>>>     MPI_Finalize();
>>>
>>>     return 0;
>>> }
>>>
>>> *Application output*
>>>
>>> Hostname server5: Me is 0.
>>> Hostname server5: Me is 1.
>>> Hostname server5: Me 1 broadcasted.
>>> Hostname server5: Me 0 broadcasted.
>>> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 172.31.1.254 failed: Connection refused (111)
>>> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 172.31.1.254 failed: Connection refused (111)
>>> Hostname server2: Me is 2.
>>> Hostname server2: Me 2 broadcasted.
>>> Hostname server2: Me is 3.
>>> Hostname server2: Me 3 broadcasted.
>>>
>>> *server2 ifconfig*
>>>
>>> eth0      Link encap:Ethernet...
>>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>>           ...
>>> eth1      Link encap:Ethernet...
>>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>>           ...
>>> eth2      Link encap:Ethernet...
>>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>>           ...
>>> eth3      Link encap:Ethernet...
>>>           inet addr:172.17.27.17  Bcast:172.17.27.255  Mask:255.255.255.0
>>>           inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link
>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>           ...
>>> lo        Link encap:Local Loopback
>>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>>>           ...
>>>
>>> mic0      Link encap:Ethernet
>>>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>>>           ...
>>>
>>> mic1      Link encap:Ethernet...
>>>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>>>           ...
>>>
>>> *server5 ifconfig*
>>> eth0      Link encap:Ethernet...
>>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>>           ...
>>>
>>> eth1      Link encap:Ethernet...
>>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>>           ...
>>>
>>> eth2      Link encap:Ethernet...
>>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>>           ...
>>>
>>> eth3      Link encap:Ethernet...
>>>           inet addr:172.17.27.14  Bcast:172.17.27.255  Mask:255.255.255.0
>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>           ...
>>>
>>> lo        Link encap:Local Loopback
>>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>>           ...
>>>
>>> mic0      Link encap:Ethernet...
>>>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>>>           UP BROADCAST RUNNING  MTU:64512  Metric:1
>>>           ...
>>>
>>> mic1      Link encap:Ethernet...
>>>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>>>           ...
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/04/26780.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/04/26784.php
>>>
>>
>>
>>
>> --
>> Best Regards
>> Ravi
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/04/26788.php
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26791.php
>



-- 
Best Regards
Ravi

Reply via email to