[OMPI users] Problem starting jobs

Emyr James Thu, 1 Oct 2015 05:24:51 -0400 (EDT)

Hi,

I am using openmpi with Platform LSF on our cluster that has 10Gbeconnectivity.Sometimes things work fine but we get a lot of occurences of mpi jobsnot getting off the ground and the following appears in the log...


"ORTE has lost communication with its daemon located on node:

  hostname:  node123

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job."

This seems to happen more frequently as the number of workers in the jobincreases. I'm wondering if there's some timeout involved here which Icould increase to make things more reliable.

I tried adding "--wait-for-server --server-wait-time 30" to the mpiruncommand line but it doesn't seem to be making any difference.


Anyone got any ideas on what might be going on ?

Cheers,

Emyr


--

The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE.

[OMPI users] Problem starting jobs

Reply via email to