Yeah, I've started seeing this on clusters where the TCP stack is a
little congested. We default to trying 60 times to send a message, but
it is done in rapid succession and doesn't really provide a lot of time.
Try setting -mca oob_tcp_peer_retries 1000 (or some number much bigger
than 60)
Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.
openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of th