Hi,

I am using openmpi with Platform LSF on our cluster that has 10Gbe connectivity. Sometimes things work fine but we get a lot of occurences of mpi jobs not getting off the ground and the following appears in the log...

"ORTE has lost communication with its daemon located on node:

  hostname:  node123

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job."

This seems to happen more frequently as the number of workers in the job increases. I'm wondering if there's some timeout involved here which I could increase to make things more reliable.

I tried adding "--wait-for-server --server-wait-time 30" to the mpirun command line but it doesn't seem to be making any difference.

Anyone got any ideas on what might be going on ?

Cheers,

Emyr


--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Reply via email to