On Mon, Jun 11, 2007 at 10:55:17PM +0100, Jonathan Underwood wrote: > Hi,
Hi! > I am seeing problems with a small linux cluster when running OpenMPI > jobs. The error message I get is: Which OMPI version? > $ perl -e 'die$!=110' > Connection timed out at -e line 1. Looks pretty much like a routing issue. Can you sniff on eth1 on the frontend node? > This error message occurs the first time one of the compute nodes, > which are on a private network, attempts to send data to the frontend > In actual fact, it seems that the error occurs the first time a > process on the frontend tries to send data to another process on the > frontend. What's the exact problem? compute-node -> frontend? I don't think you have two processes on the frontend node, and even if you do, they should use shared memory. > Any advice would be very welcome Use tcpdump and/or recompile with debug enabled. In addition, set WANT_PEER_DUMP in ompi/mca/btl/tcp/btl_tcp_endpoint.c to 1 (line 120) and recompile, thus giving you more debug output. Depending on your OMPI version, you can also add mpi_preconnect_all=1 to your ~/.openmpi/mca-params.conf, by this establishing all connections during MPI_Init(). If nothing helps, exclude the frontend from computation. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de