Re: [OMPI users] oob-tcp problem, unreachable in orted_comm

Ralph Castain Sat, 6 Jun 2009 12:25:26 -0400

Yeah, I've started seeing this on clusters where the TCP stack is alittle congested. We default to trying 60 times to send a message, butit is done in rapid succession and doesn't really provide a lot of time.

Try setting -mca oob_tcp_peer_retries 1000 (or some number much biggerthan 60). This has always fixed the problem so far.

If it works, you might want to put it in the system default mca paramfile.


On Jun 6, 2009, at 10:18 AM, Åke Sandgren wrote:

Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.

openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of the system walltime) and has never
shown this specific problem and rarely the "Retry exceeded error"
either.

[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
retries exceeded.  Can not communicate with peer

[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG:Unreachable

in file
orted/orted_comm.c at line 130

[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG:Unreachable

in file
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
retries exceeded.  Can not communicate with peer


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] oob-tcp problem, unreachable in orted_comm

Reply via email to