On Thu, Sep 13, 2007 at 11:15:47AM -0500, Tim Campbell wrote:

> workstations.  When mpirun tries to start the processes on certain  
> nodes I get the following error output.
> 
> [sr70][0,1,2][btl_tcp_endpoint.c: 
> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with  
> errno=111
> [sr71][0,1,3][btl_tcp_endpoint.c: 
> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with  
> errno=111
> 
> Using perl -e 'die$!=111' I see that the error message is "Connection  
> refused".  I am able to connect to both nodes in question via ssh and/ 

This sounds pretty much like an IP setup issue. Perhaps some nodes have
more than one interface, i.e. internal and external network,
IP-over-FireWire, ppp-Devices or something else. Exporting these
addresses would clearly cause other nodes to be unable to connect.

If so, use btl_tcp_if_exclude (or _include) to specify the right
interface.

Second problem: local firewalls. Though ssh connections might be
allowed, the sysadmin could block almost any other (destination) port,
thus causing the same error messages. (in case of
icmp-port-unreachable).

> What are some possible issues that might be causing this?  What can I  
> do to get more information?

I agree that you surely need more information. Can you recompile with
--enable-debug and change 

#define WANT_PEER_DUMP 0

in file ompi/mca/btl/tcp/btl_tcp_endpoint.c from "0" to "1" before
recompiling?

This should give you detailed information.


HTH

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de

Reply via email to