On Feb 24, 2006, at 8:23 AM, Emanuel Ziegler wrote:
So, the question from the mpirun_debug.out-file is, what IP-
addresses do
node01 and node02 have, is the local 10.0.0.1 node01, while
10.1.0.1 is
node02?
Maybe the route on node01 is not correct to node02?
Ok, I figured out the problem, but didn't solve it completely.
node01 and node02 both have multiple IP addresses.
node01 has 10.0.0.1 for TCP (eth1) and 10.1.0.1 for IPoIB (ib0).
node02 has 10.0.0.2 for TCP (eth1) and 10.1.0.2 for IPoIB (ib0).
The latter addresses are useless, but don't affect the problem. I
chose
eth1 on both machines b/c eth0 is only 10/100 MBit and I wanted to
have
GBit connections to the file server in the internal network. The
problem
was, that I set up eth0 on node01 (golden client) using DHCP on the
external network for setup purposes. Hence, it also had an external
address (129.206.102.93) which was unaccessible from node02.
Since orterun was started with the parameters
--nsreplica "0.0.0;tcp://129.206.102.93:54866;tcp://
10.0.0.1:54866"
--gprreplica "0.0.0;tcp://129.206.102.93:54866;tcp://
10.0.0.1:54866"
node02 first tried to communicate with 129.206.102.93 which was
impossible and hanged although it would have been able to access
10.0.0.1 without any problems. But obviously it never got to this
point.
Although disabling eth0 with "ifdown eth0" solves the problem, this is
not applicable to my cluster since this was just a test setup und I
need
the external address for my head node.
Can I configure orterun/orted to use only eth1?
Yes, start mpirun with the arguments "-mca oob_tcp_include eth1 -mca
btl_tcp_if_include eth1" and it should work properly. The paramaters
can also be set in either the global or per-user configuration file
for Open MPI (once you have it tested, of course). See our FAQ item
on this:
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
The second argument is because you'll probably run into the exact
same problem when the TCP transport tries to start up (although it
sounds like you're going to be using native IB for communicate, it
never hurts to make sure TCP has a chance of working).
Brian
--
Brian Barrett
Open MPI developer
http://www.open-mpi.org/