I'm trying to run a 64 way mpi benchmark on my system. I *always* get the following error and I'm wondering how do I debug the problem node? I can not reproduce the problem with a smaller number of nodes.
snip... [p1d049:18547] [0,1,48]-[0,1,20] mca_oob_tcp_peer_complete_connect: connect() fa iled with errno=113 [p1d049:18547] [0,1,48]-[0,1,21] mca_oob_tcp_peer_complete_connect: connect() fa iled with errno=113 [p1d049:18547] [0,1,48]-[0,1,24] mca_oob_tcp_peer_complete_connect: connect() fa iled with errno=113 [p1d049:18547] [0,1,48]-[0,1,25] mca_oob_tcp_peer_complete_connect: connect() fa iled with errno=113 ... It looks like I have well over 128 lines of similar output. A quick eyeball of the output seems to indicate about 1/2 of all nodes are reporting this problem. I have checked the error counters on my IB switch and I have 0 new errors during the run. TIA. R.