Greetings Robert. Can you send all the information listed here:
http://www.open-mpi.org/community/help/ Of particular interest will be the version that you are using. We had some bugs with the TCP connection code that were recently fixed. Can you try the latest 1.1.1 beta tarball and see if it fixes your problem? http://www.open-mpi.org/software/ompi/v1.1/ On 8/2/06 11:11 AM, "Robert Cummins" <rcumm...@lnxi.com> wrote: > I'm trying to run a 64 way mpi benchmark on my system. I > *always* get the following error and I'm wondering how do > I debug the problem node? I can not reproduce the problem > with a smaller number of nodes. > > snip... > [p1d049:18547] [0,1,48]-[0,1,20] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > [p1d049:18547] [0,1,48]-[0,1,21] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > [p1d049:18547] [0,1,48]-[0,1,24] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > [p1d049:18547] [0,1,48]-[0,1,25] mca_oob_tcp_peer_complete_connect: > connect() fa > iled with errno=113 > ... > > It looks like I have well over 128 lines of similar output. A quick > eyeball of > the output seems to indicate about 1/2 of all nodes are reporting this > problem. > > I have checked the error counters on my IB switch and I > have 0 new errors during the run. > > TIA. > > > R. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems