Nicolas Niclausse wrote:
Fernando Lemos ecrivait le 23/03/2010 16:28:
I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several
interfaces are private;
on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible
from cluster2.
chicon-3
eth0 inet addr:192.168.160.76 Bcast:192.168.160.255 Mask:255.255.255.0
eth1 inet addr:192.168.159.76 Bcast:192.168.159.255 Mask:255.255.255.0
myri0 inet addr:192.168.162.76 Bcast:192.168.162.255 Mask:255.255.255.0
on cluster2, nodes have 3 interfaces, and only 172.24.110.0/17 is visible
from cluster1
netgdx-8
eth0 inet addr:172.24.190.8 Bcast:172.24.191.255 Mask:255.255.192.0
eth1 inet addr:172.24.110.8 Bcast:172.24.127.255 Mask:255.255.128.0
eth2 inet addr:172.24.240.8 Bcast:172.24.255.255 Mask:255.255.192.0
so i'm using this to declare all the other networks as private:
mpirun -machinefile ~/gridnodes --mca opal_net_private_ipv4
"192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18"
./alltoall
but this doesn't work:
Have you tried -mca btl_tcp_if_include/exclude?
I can't do that because the "public" interface is not always eth1 as in
this example (i have several other clusters with different network
configurations in my setup)
Why openmpi tries to connect different private networks, given that
"public" networks exists ? is it a bug or am i missing something ?
>From what I've seen, I believe OpenMPI tries to find the fastest route
to the nodes. In some cases it's trivial to sort that out, in other
cases you might need to give it some hints.
yes, so i thought that "opal_net_private_ipv4" was the right thing for me;
but it doesn't work without the patch.
It seems to me that you are entering a piece of the code where the code
thinks at least one of the interfaces is private. And when comparing a
public and private, it gives a weighting of
CQ_PRIVATE_DIFFERENT_NETWORK. I am not sure why, but that is the
weighting it gives. You can take a look at this FAQ
http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3 which has
links to the paper that explains how all this logic works.
It seems that what you are doing makes sense. You are trying to define
which networks are private so that in the end you
expect the two other networks to end up being public, and therefore get
the highest weight for a connection.
I realize this does not help much, but maybe the paper will help out.
Rolf