With the arguments I sent you the error about connection refused should
have disappeared. Let's try to force all traffic over the first TCP
interface eth3. Try the following flags to your mpirun:

--mca pml ob1 --mca btl tcp,sm,self --mca btl_tcp_if_include eth3

  George.


On Sun, Apr 26, 2015 at 8:04 AM, Manumachu Reddy <manumachu.re...@gmail.com>
wrote:

>
> Hi George,
>
> I am afraid the suggestion to use bcl_tcp_if_exclude has not applied. I
> executed the following command:
>
> *shell$ mpirun --mca btl_tcp_if_exclude mic0,mic1 -app appfile*
> <the same output and hang>
>
> Please let me know if there are options to mpirun (apart from -v) to get
> verbose output to understand what is happening.
>
>
> On Fri, Apr 24, 2015 at 5:59 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> Manumachu,
>>
>> Both nodes have the same IP for their Phi (mic0 and mic1). This is OK as
>> long as they don't try to connect to each other using these addresses. A
>> simple fix is to prevent OMPI from using the supposedly local mic0 and mic1
>> IP. Add --mca btl_tcp_if_exclude mic0,mic1 to your mpirun command and
>> things should start working better.
>>
>> George.
>>
>>
>>
>> On Apr 24, 2015, at 03:32, Manumachu Reddy <manumachu.re...@gmail.com>
>> wrote:
>>
>>
>> Dear OpenMPI Users,
>>
>> I request your help to resolve a hang in my OpenMPI application.
>>
>> My OpenMPI application hangs in MPI_Comm_split() operation. The code for
>> this simple application is at the end of this email. Broadcast works fine.
>>
>> My experimental setup comprises of two RHEL6.4 Linux nodes. Each node has
>> 2 mic cards. Please note that although there are mic cards, I do not use
>> mic cards in my OpenMPI application.
>>
>> I have tested with two OpenMPI versions (1.6.5, 1.8.4). I see the hang in
>> both the versions. OpenMPI is installed using the following command:
>>
>> ./configure
>> --prefix=/home/manumachu/OpenMPI/openmpi-1.8.4/OPENMPI_INSTALL_ICC CC="icc
>> -fPIC" CXX="icpc -fPIC"
>>
>> I have made sure I have turned off the firewall using the following
>> commands:
>>
>> sudo service iptables save
>> sudo service iptables stop
>> sudo chkconfig iptables off
>>
>> I made sure the mic cards are online and healthy. I am able to login to
>> the mic cards.
>>
>> I use an appfile to launch 2 processes on each node.
>>
>> I have also attached the "ifconfig" list for each node. Could this
>> problem be related to multiple network interfaces (from the application
>> output also shown at the end of the email)?
>>
>> Please let me know if you need further information and I look forward to
>> your suggestions.
>>
>> Best Regards
>> Manumachu
>>
>> *Application*
>>
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> int main(int argc, char** argv)
>> {
>>     int me, hostnamelen;
>>     char hostname[MPI_MAX_PROCESSOR_NAME];
>>
>>     MPI_Init(&argc, &argv);
>>
>>     MPI_Get_processor_name(hostname, &hostnamelen);
>>
>>     MPI_Comm_rank(MPI_COMM_WORLD, &me);
>>     printf("Hostname %s: Me is %d.\n", hostname, me);
>>
>>     int a;
>>     MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>
>>     printf("Hostname %s: Me %d broadcasted.\n", hostname, me);
>>
>>     MPI_Comm intraNodeComm;
>>     int rc = MPI_Comm_split(
>>                 MPI_COMM_WORLD,
>>                 me,
>>                 me,
>>                 &intraNodeComm
>>     );
>>
>>     if (rc != MPI_SUCCESS)
>>     {
>>        printf("MAIN: Problems MPI_Comm_split...Exiting...\n");
>>        return -1;
>>     }
>>
>>     printf("Hostname %s: Me %d after comm split.\n", hostname, me);
>>     MPI_Comm_free(&intraNodeComm);
>>     MPI_Finalize();
>>
>>     return 0;
>> }
>>
>> *Application output*
>>
>> Hostname server5: Me is 0.
>> Hostname server5: Me is 1.
>> Hostname server5: Me 1 broadcasted.
>> Hostname server5: Me 0 broadcasted.
>> [server5][[50702,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 172.31.1.254 failed: Connection refused (111)
>> [server5][[50702,1],1][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 172.31.1.254 failed: Connection refused (111)
>> Hostname server2: Me is 2.
>> Hostname server2: Me 2 broadcasted.
>> Hostname server2: Me is 3.
>> Hostname server2: Me 3 broadcasted.
>>
>> *server2 ifconfig*
>>
>> eth0      Link encap:Ethernet...
>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>           ...
>> eth1      Link encap:Ethernet...
>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>           ...
>> eth2      Link encap:Ethernet...
>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>           ...
>> eth3      Link encap:Ethernet...
>>           inet addr:172.17.27.17  Bcast:172.17.27.255  Mask:255.255.255.0
>>           inet6 addr: fe80::921b:eff:fe42:a5ba/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           ...
>> lo        Link encap:Local Loopback
>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>>           ...
>>
>> mic0      Link encap:Ethernet
>>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>>           ...
>>
>> mic1      Link encap:Ethernet...
>>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>>           ...
>>
>> *server5 ifconfig*
>> eth0      Link encap:Ethernet...
>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>           ...
>>
>> eth1      Link encap:Ethernet...
>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>           ...
>>
>> eth2      Link encap:Ethernet...
>>           UP BROADCAST MULTICAST  MTU:1500  Metric:1
>>           ...
>>
>> eth3      Link encap:Ethernet...
>>           inet addr:172.17.27.14  Bcast:172.17.27.255  Mask:255.255.255.0
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           ...
>>
>> lo        Link encap:Local Loopback
>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>           ...
>>
>> mic0      Link encap:Ethernet...
>>           inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
>>           UP BROADCAST RUNNING  MTU:64512  Metric:1
>>           ...
>>
>> mic1      Link encap:Ethernet...
>>           inet addr:172.31.2.254  Bcast:172.31.2.255  Mask:255.255.255.0
>>           ...
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/04/26780.php
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/04/26784.php
>>
>
>
>
> --
> Best Regards
> Ravi
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26788.php
>

Reply via email to