I'm trying to run a 64 way mpi benchmark on my system.  I
*always* get the following error and I'm wondering how do
I debug the problem node?  I can not reproduce the problem
with a smaller number of nodes.

snip...
[p1d049:18547] [0,1,48]-[0,1,20] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
[p1d049:18547] [0,1,48]-[0,1,21] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
[p1d049:18547] [0,1,48]-[0,1,24] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
[p1d049:18547] [0,1,48]-[0,1,25] mca_oob_tcp_peer_complete_connect:
connect() fa
iled with errno=113
...

It looks like I have well over 128 lines of similar output.  A quick
eyeball of
the output seems to indicate about 1/2 of all nodes are reporting this
problem.

I have checked the error counters on my IB switch and I
have 0 new errors during the run.

TIA.


R.

Reply via email to