Re: [OMPI users] Problem starting jobs

2015-10-01 Thread Emyr James
On 01/10/2015 10:24, Emyr James wrote: "ORTE has lost communication with its daemon located on node: hostname: node123 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failu

[OMPI users] Problem starting jobs

2015-10-01 Thread Emyr James
Hi, I am using openmpi with Platform LSF on our cluster that has 10Gbe connectivity. Sometimes things work fine but we get a lot of occurences of mpi jobs not getting off the ground and the following appears in the log... "ORTE has lost communication with its daemon located on node: hostna