Nicolas Niclausse wrote:
Fernando Lemos ecrivait le 23/03/2010 16:28:
I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several
interfaces are private;

on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible
from cluster2.

chicon-3
eth0     inet addr:192.168.160.76  Bcast:192.168.160.255  Mask:255.255.255.0
eth1     inet addr:192.168.159.76  Bcast:192.168.159.255  Mask:255.255.255.0
myri0    inet addr:192.168.162.76  Bcast:192.168.162.255  Mask:255.255.255.0

on cluster2, nodes have 3 interfaces, and only 172.24.110.0/17 is visible
from cluster1

netgdx-8
eth0  inet addr:172.24.190.8  Bcast:172.24.191.255  Mask:255.255.192.0
eth1  inet addr:172.24.110.8  Bcast:172.24.127.255  Mask:255.255.128.0
eth2  inet addr:172.24.240.8  Bcast:172.24.255.255  Mask:255.255.192.0

so i'm using this to declare all the other networks as private:

mpirun -machinefile ~/gridnodes  --mca opal_net_private_ipv4
"192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18"
./alltoall

but this doesn't work:
Have you tried -mca btl_tcp_if_include/exclude?

I can't do that because the "public" interface is not always eth1 as in
this example (i have several other clusters with different network
configurations in my setup)

Why openmpi tries to connect different private networks, given that
"public" networks exists ? is it a bug or am i missing something ?
>From what I've seen, I believe OpenMPI tries to find the fastest route
to the nodes. In some cases it's trivial to sort that out, in other
cases you might need to give it some hints.

yes, so  i thought that "opal_net_private_ipv4" was the right thing for me;
but it doesn't work without the patch.

It seems to me that you are entering a piece of the code where the code thinks at least one of the interfaces is private. And when comparing a public and private, it gives a weighting of CQ_PRIVATE_DIFFERENT_NETWORK. I am not sure why, but that is the weighting it gives. You can take a look at this FAQ http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3 which has links to the paper that explains how all this logic works.

It seems that what you are doing makes sense. You are trying to define which networks are private so that in the end you expect the two other networks to end up being public, and therefore get the highest weight for a connection.

I realize this does not help much, but maybe the paper will help out.

Rolf


Reply via email to