Thanks for your suggestion Gus, we need a way of debugging what is going on.
I am pretty sure the problem lies with our cluster configuration. I know MPI
simply relies on the underlying network. However, we can ping and ssh to all
nodes (and in between and pair as well) so it is currently a mystery why MPI
doesn't communicate across nodes on our cluster.
Two further questions for the group

   1. I would love to run the test program connectivity.c, but cannot find
   it anywhere. Can anyone help please?
   2. After having left the job hanging over night we got the message
   
[node5][[9454,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
   mca_btl_tcp_frag_recv: readv failed: Connection timed out (110). Does anyone
   know what this means?


Cheers and thanks
Ole
PS - I don't see how separate buffers would help. Recall that the test
program I use works fine on other installations and indeed when run on one
the cores of one Node.




Message: 11
List-Post: users@lists.open-mpi.org
Date: Mon, 19 Sep 2011 10:37:02 -0400
From: Gus Correa <g...@ldeo.columbia.edu>
Subject: Re: [OMPI users] RE :  MPI hangs on multiple nodes
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <4e77538e.3070...@ldeo.columbia.edu>
Content-Type: text/plain; charset=iso-8859-1; format=flowed

Hi Ole

You could try the examples/connectivity.c program in the
OpenMPI source tree, to test if everything is alright.
It also hints how to solve the buffer re-use issue
that Sebastien [rightfully] pointed out [i.e., declare separate
buffers for MPI_Send and MPI_Recv].

Gus Correa

Reply via email to