Ah, I see the problem. See this FAQ entry: http://www.open-mpi.org/faq/?category=tcp#tcp-selection
You want to exclude the virbr0 interfaces on your nodes; they're local-only interfaces (that's where the 192.168.122.x addresses are coming from) that, IIRC, have something to do with virtualization. Perhaps something like this: mpirun --mca btl_if_exclude virbr0 ... Additionally, if you're bonding your IP interfaces just for MPI, you don't need to do that for Open MPI. Specifically: Open MPI will automatically used your eth0 and eth1 together without needing kernel-level bonding (OMPI ignores them right now because they're not "up", assumedly because bond0 is utilizing them). If you're using other stuff that can't do userspace-level binding like Open MPI, you might still need to use kernel-level bonding. On Sep 25, 2012, at 12:51 PM, Richard wrote: > I have setup a small cluster with 3 nodes, named A, B and C respectively. > I tested the ring_c.c program in the examples. For debugging purpose, > I have added some print statements as follows in the original ring_c.c > >> 60 printf("rank %d, message %d,start===========\n", rank, > >> message); > >> 61 MPI_Send(&message, 1, MPI_INT, ! next, tag, MPI_COMM_WORLD); > >> 62 printf("rank %d, message %d,end-------------\n", rank, > >> message); > > I lauched my mpi program as follows: > $mpirun -np 3 --hostfile myhostfile ./ring > the content in myhost file is: > ========= > hostA slots=1 > hostB slots=1 > hostC slots=1 > ============ > I got the follow output: > ========================== > Process 0 sending 10 to 1, tag 201 (3 processes in ring) > Process 0 sent to 1 > rank 1, message 10,start=========== > rank 1, message 10,end------------- > rank 2, message 10,start=========== > Process 0 decremented value: 9 > rank 0, message 9,start=========== > rank 0, message 9,end------------- > rank 2, message 10,end------------- > rank 1, message 9,start=========== > > ========================= > I assumed there is communication problem between B and C, so I launched the > program with 2 processes on B and C. > the output is as follows: > =============== > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > Process 0 sent to 1 > rank 1, message 10,start=========== > rank 1, message 10,end------------- > Process 0 decremented value: 9 > rank 0, message 9,start=========== > =============== > > Again, in the second round of pass, B failed to send message to C. > I checked firewall config using chkconfig --list iptables on all the nodes. > none of them are set as "on". > > Attached is all the information needed, my openmpi version is 1.6.1. > > thanks for your help. > Richard > > > > At 2012-09-25 18:27:15,Richard <codemon...@163.com> wrote: > I used "chkconfig --list iptables ", none of computer is set as "on". > > At 2012-09-25 17:54:53,"Jeff Squyres" <jsquy...@cisco.com> wrote: > >Hav you disabled firewalls ! > on your nodes (e.g., iptables)? > > > >On Sep 25, 2012, at 11:08 AM, Richard wrote: > > > >> sometimes the following message jumped out when I run the ring program, > >> but not always. > >> I do not know this ip address 192.168.122.1, it is not in my list of > >> hosts. > >> > >> > >> [[53402,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > >> connect() to 192.168.122.1 failed: Connection refused (111 > >> > >> > >> > >> > >> > >> At 2012-09-25 16:53:50,Richard < > codemon...@163.com > > wrote: > >> > >> if I tried the ring program, the first round of pass is fine, but the > >> second round is blocked at some node. > >> here is the message printed out > >> > >> Process 0 sending 10 to 1, tag 201 (3 processes! in ring) > >> Process 0 sent to 1 > >> rank 1, message 10,start=========== > >> rank 1, message 10,end------------- > >> rank 2, message 10,start=========== > >> Process 0 decremented value: 9 > >> rank 0, message 9,start=========== > >> rank 0, message 9,end------------- > >> rank 2, message 10,end------------- > >> rank 1, message 9,start=========== > >> > >> I have added some printf statements in the ring_c.c as follows: > >> 60 printf("rank %d, message %d,start===========\n", rank, > >> message); > >> 61 MPI_Send(&message, 1, MPI_INT, ! next, tag, MPI_COMM_WORLD); > >> 62 printf("rank %d, message %d,end-------------\n", rank, > >> message); > >> > >> > >> > >> At 2012-09-25 16:30:01,Richard < > codemon...@163.com > > wrote: > >> Hi Jody, > >> thanks for your suggestion and you are right. if I use the ring example, > >> the same happened. > >> I have put a printf statement, it seems that all the three processed have > >> reached the line > >> calling "PMPI_Allreduce", any further suggestion? > >> > >> Thanks. > >> Richard > >> > >> > >> > >> Message: 12 > >> Date: Tue, 25 Sep 2012 09:43:09 +0200 > >> From: jody < > >> > jody....@gmail.com > > >> > > >> Subject: Re: [OMPI users] mpi job is blocked > >> To: Open MPI Users < > >> > us...@open-mpi.org > > >> > > >> Message-ID: > >> < > >> > cakbzmgfl0txdyu82hksohrwh34cbpwbkmrkwc5dcdbt7a7w...@mail.gmail.com > > >> > > >> Content-Type: text/plain; charset=ISO-8859-1 > >> > >> Hi Richard > >> > >> When a collective call hangs, this usually means that one (or more) > >> processes did not reach this command. > >> Are you sure that all processes reach the allreduce statement? > >> > >> If something like this happens to me, i insert print statements just > >> before the MPI-call so i can see which processes made > >> it to this point and which ones did not. > >> > >> Hope this helps a bit > >> Jody > >> > >> On Tue, Sep 25, 2012 at 8:20 AM, Richard < > >> > codemon...@163.com > > >> > wrote: > >> > I have 3 computers with the same Linux system. I have setup the mpi > >> > cluster > >> > based on ssh connection. > >> > I have tested a very simple mpi program, it works on the cluster. > >> > > >> > To make my story clear, I name the three computer as A, B and C. > >> > > >> > 1) If I run the job with 2 processes on A and B, it works. > >> > 2) if I run the job with 3 processes on A, B and C, it is blocked. > >> > 3) if I run the job with 2 processes on A and C, it works. > >> > 4) If I run the job with all the 3 processes on A, it works. > >> > > >> > Using gdb I found the line at which it is blocked, it is here > >> > > >> > #7 0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578, > >> > recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780, > >> > comm=0x627380) > >> > at pallreduce.c:105 > >> > 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count, > >> > > >> > It seems that there is a communication problem between some computers. > >> > But > >> > the above series of test cannot tell me what > >> > exactly it is. Can anyone help me? thanks. > >> > > >> > Richard > >> > > >> > > >> > > >> > > >> > _______________________________________________ > >> > users mailing list > >> > > >> > us...@open-mpi.org > > >> > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> > us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > >-- > >Jeff Squyres > > > jsquy...@cisco.com > > >For corporate legal information go to: > >http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > >_______________________________________________ > >users mailing list > > > us...@open-mpi.org > > >http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > <trouble.tgz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/