I have 3 computers with the same Linux system. I have setup the mpi cluster based on ssh connection. I have tested a very simple mpi program, it works on the cluster.
To make my story clear, I name the three computer as A, B and C. 1) If I run the job with 2 processes on A and B, it works. 2) if I run the job with 3 processes on A, B and C, it is blocked. 3) if I run the job with 2 processes on A and C, it works. 4) If I run the job with all the 3 processes on A, it works. Using gdb I found the line at which it is blocked, it is here #7 0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578, recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780, comm=0x627380) at pallreduce.c:105 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count, It seems that there is a communication problem between some computers. But the above series of test cannot tell me what exactly it is. Can anyone help me? thanks. Richard