Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.

openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of the system walltime) and has never
shown this specific problem and rarely the "Retry exceeded error"
either.

[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
 retries exceeded.  Can not communicate with peer
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file 
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file 
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
 retries exceeded.  Can not communicate with peer


-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

Reply via email to