On Mon, Jun 11, 2007 at 10:55:17PM +0100, Jonathan Underwood wrote:

> Hi,

Hi!

> I am seeing problems with a small linux cluster when running OpenMPI
> jobs. The error message I get is:

Which OMPI version?

> $ perl -e 'die$!=110'
> Connection timed out at -e line 1.

Looks pretty much like a routing issue. Can you sniff on eth1 on the
frontend node?

> This error message occurs the first time one of the compute nodes,
> which are on a private network, attempts to send data to the frontend

> In actual fact, it seems that the error occurs the first time a
> process on the frontend tries to send data to another process on the
> frontend.

What's the exact problem? compute-node -> frontend? I don't think you
have two processes on the frontend node, and even if you do, they should
use shared memory.

> Any advice would be very welcome

Use tcpdump and/or recompile with debug enabled. In addition, set
WANT_PEER_DUMP in ompi/mca/btl/tcp/btl_tcp_endpoint.c to 1 (line 120)
and recompile, thus giving you more debug output.

Depending on your OMPI version, you can also add

mpi_preconnect_all=1

to your ~/.openmpi/mca-params.conf, by this establishing all connections
during MPI_Init().

If nothing helps, exclude the frontend from computation.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de

Reply via email to