I have 3 computers with the same Linux system. I have setup the mpi cluster 
based on ssh connection.
I have tested a very simple mpi program, it works on the cluster.



To make my story clear, I name the three computer as A, B and C.


1) If I run the job with 2 processes on A and B, it works.
2) if I run the job with 3 processes on A, B and C, it is blocked.
3) if I run the job with 2 processes on A and C, it works.
4) If I run the job with all the 3 processes on A, it works.


Using gdb I found the line at which it is blocked, it is here


#7  0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578, 
recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780, comm=0x627380)
    at pallreduce.c:105
105         err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count,


It seems that there is a communication problem between some computers. But the 
above series of test cannot tell me what 
exactly it is. Can anyone help me? thanks.


Richard

Reply via email to