Hi! I finally installed OpenMPI 1.0.2-a7 with libibverbs-1.0-rc5 and libmthca-1.0-rc5 on Debian sarge with kernel 2.6.15 (from www.backports.org) in order to use InfiniBand.
While InfiniBand seems to be working (ping with IPoIB works perfectly),
the mpirun/orterun command causes trouble using rsh as well as ssh.
The /usr/local/etc/openmpi-default-hostfile contains
node01 slots=2
node02 slots=2
Both hosts are completely identical (apart from network config) and the
problem is symmetric.
Although I can execute commands (all on node01) like
$ mpirun -np 1 hostname
node01
and
$ rsh node02 hostname
node02
the command
$ mpirun -np 4 hostname
node01
node01
hangs. After pressing Ctrl+C it stops, but gives no hint about the cause
of the problem.
An output of
$ mpirun --debug -np 4 hostname
can be found in the attachment. The important line seems to be
[node02:12018] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connect() failed with errno=113
Unfortunately, I don't know what errno=113 means, but obviously it's a
TCP problem.
It doesn't seem to matter if orted runs or not. No processes are
launched on the remote host.
Thanks,
Emanuel
config.log.gz
Description: GNU Zip compressed data
mpirun_debug.out.gz
Description: GNU Zip compressed data
ompi_info.out.gz
Description: GNU Zip compressed data
