Yeah, I've started seeing this on clusters where the TCP stack is a little congested. We default to trying 60 times to send a message, but it is done in rapid succession and doesn't really provide a lot of time.

Try setting -mca oob_tcp_peer_retries 1000 (or some number much bigger than 60). This has always fixed the problem so far.

If it works, you might want to put it in the system default mca param file.

On Jun 6, 2009, at 10:18 AM, Åke Sandgren wrote:

Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.

openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of the system walltime) and has never
shown this specific problem and rarely the "Retry exceeded error"
either.

[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
retries exceeded.  Can not communicate with peer
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
retries exceeded.  Can not communicate with peer


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to