Thanks for your suggestion Gus, we need a way of debugging what is going on. I am pretty sure the problem lies with our cluster configuration. I know MPI simply relies on the underlying network. However, we can ping and ssh to all nodes (and in between and pair as well) so it is currently a mystery why MPI doesn't communicate across nodes on our cluster. Two further questions for the group
1. I would love to run the test program connectivity.c, but cannot find it anywhere. Can anyone help please? 2. After having left the job hanging over night we got the message [node5][[9454,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110). Does anyone know what this means? Cheers and thanks Ole PS - I don't see how separate buffers would help. Recall that the test program I use works fine on other installations and indeed when run on one the cores of one Node. Message: 11 List-Post: users@lists.open-mpi.org Date: Mon, 19 Sep 2011 10:37:02 -0400 From: Gus Correa <g...@ldeo.columbia.edu> Subject: Re: [OMPI users] RE : MPI hangs on multiple nodes To: Open MPI Users <us...@open-mpi.org> Message-ID: <4e77538e.3070...@ldeo.columbia.edu> Content-Type: text/plain; charset=iso-8859-1; format=flowed Hi Ole You could try the examples/connectivity.c program in the OpenMPI source tree, to test if everything is alright. It also hints how to solve the buffer re-use issue that Sebastien [rightfully] pointed out [i.e., declare separate buffers for MPI_Send and MPI_Recv]. Gus Correa