> So, the question from the mpirun_debug.out-file is, what IP-addresses do > node01 and node02 have, is the local 10.0.0.1 node01, while 10.1.0.1 is > node02? > Maybe the route on node01 is not correct to node02?
Ok, I figured out the problem, but didn't solve it completely. node01 and node02 both have multiple IP addresses. node01 has 10.0.0.1 for TCP (eth1) and 10.1.0.1 for IPoIB (ib0). node02 has 10.0.0.2 for TCP (eth1) and 10.1.0.2 for IPoIB (ib0). The latter addresses are useless, but don't affect the problem. I chose eth1 on both machines b/c eth0 is only 10/100 MBit and I wanted to have GBit connections to the file server in the internal network. The problem was, that I set up eth0 on node01 (golden client) using DHCP on the external network for setup purposes. Hence, it also had an external address (129.206.102.93) which was unaccessible from node02. Since orterun was started with the parameters --nsreplica "0.0.0;tcp://129.206.102.93:54866;tcp://10.0.0.1:54866" --gprreplica "0.0.0;tcp://129.206.102.93:54866;tcp://10.0.0.1:54866" node02 first tried to communicate with 129.206.102.93 which was impossible and hanged although it would have been able to access 10.0.0.1 without any problems. But obviously it never got to this point. Although disabling eth0 with "ifdown eth0" solves the problem, this is not applicable to my cluster since this was just a test setup und I need the external address for my head node. Can I configure orterun/orted to use only eth1? Thanks for Your help, Emanuel