Re: [OMPI users] mpi test program "ring" failed: blocked at MPI_Send

Jeff Squyres Tue, 25 Sep 2012 08:28:05 -0400

Ah, I see the problem.  See this FAQ entry:

    http://www.open-mpi.org/faq/?category=tcp#tcp-selection


You want to exclude the virbr0 interfaces on your nodes; they're local-only 
interfaces (that's where the 192.168.122.x addresses are coming from) that, 
IIRC, have something to do with virtualization.  Perhaps something like this:

  mpirun --mca btl_if_exclude virbr0 ...

Additionally, if you're bonding your IP interfaces just for MPI, you don't need 
to do that for Open MPI.  Specifically: Open MPI will automatically used your 
eth0 and eth1 together without needing kernel-level bonding (OMPI ignores them 
right now because they're not "up", assumedly because bond0 is utilizing them). 
 If you're using other stuff that can't do userspace-level binding like Open 
MPI, you might still need to use kernel-level bonding.


On Sep 25, 2012, at 12:51 PM, Richard wrote:

> I have setup a small cluster with 3 nodes, named A, B and C respectively.
> I tested the ring_c.c program in the examples. For debugging purpose,
> I have added some print statements as follows in the original ring_c.c
> >>  60         printf("rank %d, message %d,start===========\n", rank, 
> >> message);
> >>  61         MPI_Send(&message, 1, MPI_INT, ! next, tag, MPI_COMM_WORLD);
> >>  62         printf("rank %d, message %d,end-------------\n", rank, 
> >> message);
> 
> I lauched my mpi program as follows:
> $mpirun -np 3 --hostfile myhostfile ./ring
> the content in myhost file is:
> =========
> hostA  slots=1
> hostB slots=1
> hostC slots=1
> ============
> I got the follow output:
> ==========================
> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> Process 0 sent to 1
> rank 1, message 10,start===========
> rank 1, message 10,end-------------
> rank 2, message 10,start===========
> Process 0 decremented value: 9
> rank 0, message 9,start===========
> rank 0, message 9,end-------------
> rank 2, message 10,end-------------
> rank 1, message 9,start===========
> 
> =========================
> I assumed there is communication problem between B and C, so I launched the 
> program with 2 processes on B and C.
> the output is as follows:
> ===============
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> rank 1, message 10,start===========
> rank 1, message 10,end-------------
> Process 0 decremented value: 9
> rank 0, message 9,start===========
> ===============
> 
> Again, in the second round of pass, B failed to send message to C. 
> I checked firewall config using chkconfig --list iptables on all the nodes. 
> none of them are set as "on".
> 
> Attached is all the information needed, my openmpi version is 1.6.1.
> 
> thanks for your help.
> Richard
> 
> 
> 
> At 2012-09-25 18:27:15,Richard <codemon...@163.com> wrote:
> I used "chkconfig --list iptables ",  none of computer is set as "on".
> 
> At 2012-09-25 17:54:53,"Jeff Squyres" <jsquy...@cisco.com> wrote:
> >Hav you disabled firewalls !
>  on your nodes (e.g., iptables)?
> >
> >On Sep 25, 2012, at 11:08 AM, Richard wrote:
> >
> >> sometimes the following message jumped out when I run the ring program, 
> >> but not always.
> >> I do not know this ip address  192.168.122.1, it is not in my list of 
> >> hosts.
> >> 
> >> 
> >> [[53402,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> >>  connect() to 192.168.122.1 failed: Connection refused (111
> >> 
> >> 
> >> 
> >> 
> >> 
> >> At 2012-09-25 16:53:50,Richard <
> codemon...@163.com
> > wrote:
> >> 
> >> if I tried the ring program, the first round of pass is fine, but the 
> >> second round is blocked at some node.
> >> here is the message printed out
> >> 
> >> Process 0 sending 10 to 1, tag 201 (3 processes! in ring)
> >> Process 0 sent to 1
> >> rank 1, message 10,start===========
> >> rank 1, message 10,end-------------
> >> rank 2, message 10,start===========
> >> Process 0 decremented value: 9
> >> rank 0, message 9,start===========
> >> rank 0, message 9,end-------------
> >> rank 2, message 10,end-------------
> >> rank 1, message 9,start===========
> >> 
> >> I have added some printf statements in the ring_c.c as follows:
> >>  60         printf("rank %d, message %d,start===========\n", rank, 
> >> message);
> >>  61         MPI_Send(&message, 1, MPI_INT, ! next, tag, MPI_COMM_WORLD);
> >>  62         printf("rank %d, message %d,end-------------\n", rank, 
> >> message);
> >> 
> >> 
> >> 
> >> At 2012-09-25 16:30:01,Richard <
> codemon...@163.com
> > wrote:
> >> Hi Jody,
> >> thanks for your suggestion and you are right. if I use the ring example, 
> >> the same happened.
> >> I have put a printf statement, it seems that all the three processed have 
> >> reached the line 
> >> calling "PMPI_Allreduce", any further suggestion?
> >> 
> >> Thanks.
> >> Richard
> >> 
> >> 
> >> 
> >> Message: 12
> >> Date: Tue, 25 Sep 2012 09:43:09 +0200
> >> From: jody <
> >> 
> jody....@gmail.com
> 
> >> >
> >> Subject: Re: [OMPI users] mpi job is blocked
> >> To: Open MPI Users <
> >> 
> us...@open-mpi.org
> 
> >> >
> >> Message-ID:
> >>    <
> >> 
> cakbzmgfl0txdyu82hksohrwh34cbpwbkmrkwc5dcdbt7a7w...@mail.gmail.com
> 
> >> >
> >> Content-Type: text/plain; charset=ISO-8859-1
> >> 
> >> Hi Richard
> >> 
> >> When a collective call hangs, this usually means that one (or more)
> >> processes did not reach this command.
> >> Are you sure that all processes reach the allreduce statement?
> >> 
> >> If something like this happens to me, i insert print statements just
> >> before the MPI-call so i can see which processes made
> >> it to this point and which ones did not.
> >> 
> >> Hope this helps a bit
> >>   Jody
> >> 
> >> On Tue, Sep 25, 2012 at 8:20 AM, Richard <
> >> 
> codemon...@163.com
> 
> >> > wrote:
> >> > I have 3 computers with the same Linux system. I have setup the mpi 
> >> > cluster
> >> > based on ssh connection.
> >> > I have tested a very simple mpi program, it works on the cluster.
> >> >
> >> > To make my story clear, I name the three computer as A, B and C.
> >> >
> >> > 1) If I run the job with 2 processes on A and B, it works.
> >> > 2) if I run the job with 3 processes on A, B and C, it is blocked.
> >> > 3) if I run the job with 2 processes on A and C, it works.
> >> > 4) If I run the job with all the 3 processes on A, it works.
> >> >
> >> > Using gdb I found the line at which it is blocked, it is here
> >> >
> >> > #7  0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578,
> >> > recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780,
> >> > comm=0x627380)
> >> >     at pallreduce.c:105
> >> > 105         err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count,
> >> >
> >> > It seems that there is a communication problem between some computers. 
> >> > But
> >> > the above series of test cannot tell me what
> >> > exactly it is. Can anyone help me? thanks.
> >> >
> >> > Richard
> >> >
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > users mailing list
> >> > 
> >> 
> us...@open-mpi.org
> 
> >> 
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> users mailing list
> >> 
> us...@open-mpi.org
> 
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >-- 
> >Jeff Squyres
> >
> jsquy...@cisco.com
> 
> >For corporate legal information go to: 
> >http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >_______________________________________________
> >users mailing list
> >
> us...@open-mpi.org
> 
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> 
> <trouble.tgz>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] mpi test program "ring" failed: blocked at MPI_Send

Reply via email to