Hi Claire,

 

The most probable reason for the observed behaviour is that there are 
additional active network interfaces on the nodes that cannot be used to pass 
data around. Example of such interfaces are various virtual Ethernet devices 
(e.g. on systems with virtualisation enabled) or tunnels. Open MPI tries to 
maximise the network bandwidth by cycling over the available endpoints on each 
node (with the basic presumption being that different IP addresses are routed 
over different physical networks and hence more bandwidth is available) and 
that's why it fails with more than one message - the first message goes to the 
reachable node IP address while the second one gets directed to an unreachable 
one.

 

The solution is to either tell Open MPI to ignore the offending interfaces or 
to specifically state what interfaces are to be used by the TCP BTL and OOB 
components. This entry in the FAQ gives more details:

 

http://www.open-mpi.org/faq/?category=tcp#tcp-selection

 

Probably the following options would remedy your problem:

 

--mca btl_tcp_if_exclude 192.168.0.0/16,127.0.0.1/8

--mca btl_oob_if_exclude 192.168.0.0/16,127.0.0.1/8

 

Note that the loopback interface has to be part of the excluded interfaces list 
if the latter is provided.

 

The list of the active interfaces can be obtained with the "/sbin/ifconfig" 
command. Look for interfaces in state "UP".

 

--

Hristo Iliev, PhD – High Performance Computing Team

RWTH Aachen University, Center for Computing and Communication

Rechen- und Kommunikationszentrum der RWTH Aachen

Seffenter Weg 23, D 52074 Aachen (Germany)

Phone: +49 241 80 24367 – Fax/UMS: +49 241 80 624367

 

 

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Claire Williams
Sent: Tuesday, June 18, 2013 7:15 PM
To: us...@open-mpi.org
Subject: [OMPI users] Trouble with Sending Multiple messages to the Same Machine

 

Hi guys ☺!

 

I'm working with a simple "Hello, World" MPI program that has one master and is 
sending one message to each worker, receives a message back from each of the 
workers, and re-sends a new message. This unfortunately is not working :(. When 
the master only sends one message to each worker, and then receives it, it is 
working fine, but there are problems with sending more than one message to each 
worker. When it happens, it prints the error:

 

[[401,1],0][../../../../../openmpi-1.6.3/ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 192.168.X.X failed: No route to host (113)

 

I'm wondering how I can go about fixing this. This program is running across 
multiple Linux nodes, by the way :). 

 

BTW, I'm a girl.

 

 

 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to