Just got this in a user job. Any idea why it complains like this. The original error was the infamous "RETRY EXCEEDED ERROR" but instead of killing the job it showed this and never died. I have never seen this happen before.
openmpi 1.3.2, built with intel 10.1 This binary is used ALOT (+50% of the system walltime) and has never shown this specific problem and rarely the "Retry exceeded error" either. [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp: Communication retries exceeded. Can not communicate with peer [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable in file orted/orted_comm.c at line 130 [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable in file orted/orted_comm.c at line 130 [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp: Communication retries exceeded. Can not communicate with peer -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se