Hi, I want to use openmpi across two machines, each machine has more than one NIC:
wukong: eth0 (152.48.249.102, no MPI traffic), eth1 (128.109.34.20,yes MPI traffic) zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12, no MPI traffic) on wukong, I have : [humphrey@wukong ~]$ more ~/.openmpi/mca-params.conf btl_tcp_if_include=eth1 on zelda01, I have : [humphrey@zelda01 humphrey]$ more ~/.openmpi/mca-params.conf btl_tcp_if_include=eth0 Here's what I get when I attempt to run it from wukong (128.109.34.20). It just hangs at this point, as I believe the remote machine (Zelda01) is trying to make contact with wukong on the non-accessible interface (152.48.249.102). This is based on openmpi-1.0rc5r7944. What am I doing wrong? Thanks, Marty Marty Humphrey Assistant Professor Department of Computer Science University of Virginia [humphrey@wukong ~]$ mpirun -d --mca btl tcp --host 128.109.34.20,130.207.252.131 -np 2 a.out [wukong.ncren.net:17236] [0,0,0] setting up session dir with [wukong.ncren.net:17236] universe default-universe [wukong.ncren.net:17236] user humphrey [wukong.ncren.net:17236] host wukong.ncren.net [wukong.ncren.net:17236] jobid 0 [wukong.ncren.net:17236] procid 0 [wukong.ncren.net:17236] procdir: /tmp/openmpi-sessions-humphrey@wukong.ncren.net_0/default-universe/0/0 [wukong.ncren.net:17236] jobdir: /tmp/openmpi-sessions-humphrey@wukong.ncren.net_0/default-universe/0 [wukong.ncren.net:17236] unidir: /tmp/openmpi-sessions-humphrey@wukong.ncren.net_0/default-universe [wukong.ncren.net:17236] top: openmpi-sessions-humphrey@wukong.ncren.net_0 [wukong.ncren.net:17236] tmp: /tmp [wukong.ncren.net:17236] [0,0,0] contact_file /tmp/openmpi-sessions-humphrey@wukong.ncren.net_0/default-universe/universe- setup.txt [wukong.ncren.net:17236] [0,0,0] wrote setup file [wukong.ncren.net:17236] pls:rsh: local csh: 0, local bash: 1 [wukong.ncren.net:17236] pls:rsh: assuming same remote shell as local shell [wukong.ncren.net:17236] pls:rsh: remote csh: 0, remote bash: 1 [wukong.ncren.net:17236] pls:rsh: final template argv: [wukong.ncren.net:17236] pls:rsh: ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe humph...@wukong.ncren.net:default-universe --nsreplica "0.0.0;tcp://152.48.249.102:33964;tcp://128.109.34.20:33964" --gprreplica "0.0.0;tcp://152.48.249.102:33964;tcp://128.109.34.20:33964" --mpi-call-yield 0 [wukong.ncren.net:17236] pls:rsh: launching on node 128.109.34.20 [wukong.ncren.net:17236] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [wukong.ncren.net:17236] pls:rsh: 128.109.34.20 is a LOCAL node [wukong.ncren.net:17236] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename 128.109.34.20 --universe humph...@wukong.ncren.net:default-universe --nsreplica "0.0.0;tcp://152.48.249.102:33964;tcp://128.109.34.20:33964" --gprreplica "0.0.0;tcp://152.48.249.102:33964;tcp://128.109.34.20:33964" --mpi-call-yield 0 [wukong.ncren.net:17237] [0,0,1] setting up session dir with [wukong.ncren.net:17237] universe default-universe [wukong.ncren.net:17237] user humphrey [wukong.ncren.net:17237] host 128.109.34.20 [wukong.ncren.net:17237] jobid 0 [wukong.ncren.net:17237] procid 1 [wukong.ncren.net:17237] procdir: /tmp/openmpi-sessions-humphrey@128.109.34.20_0/default-universe/0/1 [wukong.ncren.net:17237] jobdir: /tmp/openmpi-sessions-humphrey@128.109.34.20_0/default-universe/0 [wukong.ncren.net:17237] unidir: /tmp/openmpi-sessions-humphrey@128.109.34.20_0/default-universe [wukong.ncren.net:17237] top: openmpi-sessions-humphrey@128.109.34.20_0 [wukong.ncren.net:17237] tmp: /tmp [wukong.ncren.net:17236] pls:rsh: launching on node 130.207.252.131 [wukong.ncren.net:17236] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [wukong.ncren.net:17236] pls:rsh: 130.207.252.131 is a REMOTE node [wukong.ncren.net:17236] pls:rsh: executing: ssh 130.207.252.131 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename 130.207.252.131 --universe humph...@wukong.ncren.net:default-universe --nsreplica "0.0.0;tcp://152.48.249.102:33964;tcp://128.109.34.20:33964" --gprreplica "0.0.0;tcp://152.48.249.102:33964;tcp://128.109.34.20:33964" --mpi-call-yield 0 [zelda01.localdomain:08631] [0,0,2] setting up session dir with [zelda01.localdomain:08631] universe default-universe [zelda01.localdomain:08631] user humphrey [zelda01.localdomain:08631] host 130.207.252.131 [zelda01.localdomain:08631] jobid 0 [zelda01.localdomain:08631] procid 2 [zelda01.localdomain:08631] procdir: /tmp/openmpi-sessions-humphrey@130.207.252.131_0/default-universe/0/2 [zelda01.localdomain:08631] jobdir: /tmp/openmpi-sessions-humphrey@130.207.252.131_0/default-universe/0 [zelda01.localdomain:08631] unidir: /tmp/openmpi-sessions-humphrey@130.207.252.131_0/default-universe [zelda01.localdomain:08631] top: openmpi-sessions-humphrey@130.207.252.131_0 [zelda01.localdomain:08631] tmp: /tmp [wukong.ncren.net:17239] [0,1,0] setting up session dir with [wukong.ncren.net:17239] universe default-universe [wukong.ncren.net:17239] user humphrey [wukong.ncren.net:17239] host 128.109.34.20 [wukong.ncren.net:17239] jobid 1 [wukong.ncren.net:17239] procid 0 [wukong.ncren.net:17239] procdir: /tmp/openmpi-sessions-humphrey@128.109.34.20_0/default-universe/1/0 [wukong.ncren.net:17239] jobdir: /tmp/openmpi-sessions-humphrey@128.109.34.20_0/default-universe/1 [wukong.ncren.net:17239] unidir: /tmp/openmpi-sessions-humphrey@128.109.34.20_0/default-universe [wukong.ncren.net:17239] top: openmpi-sessions-humphrey@128.109.34.20_0 [wukong.ncren.net:17239] tmp: /tmp