Hi,
I am using openmpi with Platform LSF on our cluster that has 10Gbe
connectivity.
Sometimes things work fine but we get a lot of occurences of mpi jobs
not getting off the ground and the following appears in the log...
"ORTE has lost communication with its daemon located on node:
hostname: node123
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job."
This seems to happen more frequently as the number of workers in the job
increases. I'm wondering if there's some timeout involved here which I
could increase to make things more reliable.
I tried adding "--wait-for-server --server-wait-time 30" to the mpirun
command line but it doesn't seem to be making any difference.
Anyone got any ideas on what might be going on ?
Cheers,
Emyr
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.