+1 Additionally, if you're trying to debug your machines/network/setup, you might want to use something simpler, like the ring programs in the examples/ directory.
On Sep 25, 2012, at 9:43 AM, jody wrote: > Hi Richard > > When a collective call hangs, this usually means that one (or more) > processes did not reach this command. > Are you sure that all processes reach the allreduce statement? > > If something like this happens to me, i insert print statements just > before the MPI-call so i can see which processes made > it to this point and which ones did not. > > Hope this helps a bit > Jody > > On Tue, Sep 25, 2012 at 8:20 AM, Richard <codemon...@163.com> wrote: >> I have 3 computers with the same Linux system. I have setup the mpi cluster >> based on ssh connection. >> I have tested a very simple mpi program, it works on the cluster. >> >> To make my story clear, I name the three computer as A, B and C. >> >> 1) If I run the job with 2 processes on A and B, it works. >> 2) if I run the job with 3 processes on A, B and C, it is blocked. >> 3) if I run the job with 2 processes on A and C, it works. >> 4) If I run the job with all the 3 processes on A, it works. >> >> Using gdb I found the line at which it is blocked, it is here >> >> #7 0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578, >> recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780, >> comm=0x627380) >> at pallreduce.c:105 >> 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count, >> >> It seems that there is a communication problem between some computers. But >> the above series of test cannot tell me what >> exactly it is. Can anyone help me? thanks. >> >> Richard >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/