Re: [OMPI users] oob-tcp problem, unreachable in orted_comm

2009-06-06 Thread Ralph Castain
Yeah, I've started seeing this on clusters where the TCP stack is a little congested. We default to trying 60 times to send a message, but it is done in rapid succession and doesn't really provide a lot of time. Try setting -mca oob_tcp_peer_retries 1000 (or some number much bigger than 60)

[OMPI users] oob-tcp problem, unreachable in orted_comm

2009-06-06 Thread Åke Sandgren
Just got this in a user job. Any idea why it complains like this. The original error was the infamous "RETRY EXCEEDED ERROR" but instead of killing the job it showed this and never died. I have never seen this happen before. openmpi 1.3.2, built with intel 10.1 This binary is used ALOT (+50% of th