On Thu, Sep 13, 2007 at 11:15:47AM -0500, Tim Campbell wrote: > workstations. When mpirun tries to start the processes on certain > nodes I get the following error output. > > [sr70][0,1,2][btl_tcp_endpoint.c: > 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with > errno=111 > [sr71][0,1,3][btl_tcp_endpoint.c: > 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with > errno=111 > > Using perl -e 'die$!=111' I see that the error message is "Connection > refused". I am able to connect to both nodes in question via ssh and/
This sounds pretty much like an IP setup issue. Perhaps some nodes have more than one interface, i.e. internal and external network, IP-over-FireWire, ppp-Devices or something else. Exporting these addresses would clearly cause other nodes to be unable to connect. If so, use btl_tcp_if_exclude (or _include) to specify the right interface. Second problem: local firewalls. Though ssh connections might be allowed, the sysadmin could block almost any other (destination) port, thus causing the same error messages. (in case of icmp-port-unreachable). > What are some possible issues that might be causing this? What can I > do to get more information? I agree that you surely need more information. Can you recompile with --enable-debug and change #define WANT_PEER_DUMP 0 in file ompi/mca/btl/tcp/btl_tcp_endpoint.c from "0" to "1" before recompiling? This should give you detailed information. HTH -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de