Hi Everyone,

I'm having a very basic problem getting an MPI job to run on multiple nodes. My setup consists of two identically configured nodes, called node01 and node02, connected via ethernet and infiniband. They are running CentOS 5.2 and the bundled OMPI, version 1.2.5. I've attached the output of "ompi_info --all", which is the same for both nodes.

The problem is that if I run any of the following (on node01), mpirun simply hangs:

mpirun -np 2 -host node01,node02 uname
mpirun -host node02 uname
mpirun -host node02 -mca btl tcp,self uname
mpirun -host node02 -mca btl tcp,self,^openib uname

Of course, before running "uname" as a test, I had been trying out a simple MPI code with the same result. At this point, to keep things simple, I'm not too worried about getting the infiniband working. I even went so far as to unload the infiniband kernel modules (via "/etc/init.d/openibd stop" on both nodes) to make sure OMPI was using ethernet only.

As a sanity check, each of the following works fine:

node01:~ % mpirun uname
Linux
node01:~ % mpirun -np 2 uname
Linux
Linux
node01:~ % ssh node02 uname
Linux
node01:~ % ssh node02 mpirun -np 2 uname
Linux
Linux
node01:~ % ssh node02 echo \$PATH
/usr/lib64/openmpi/1.2.5-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib64/openmpi/1.2.5-gcc/bin:/home/rbabich/bin:.
node01:~ % ssh node02 echo \$LD_LIBRARY_PATH
/usr/lib64/openmpi/1.2.5-gcc/lib:/usr/local/cuda/lib

Both $PATH and $LD_LIBRARY_PATH seem to be set correctly. There is no firewall running on either of the nodes, and everything I've said holds true if I reverse the roles of node01 and node02. In particular, I can ssh both ways. The local network is specified with a simple /etc/hosts:

127.0.0.1       localhost.localdomain   localhost
: : 1   localhost6.localdomain6 localhost6

192.168.0.1     frontend
192.168.0.101   node01
192.168.0.102   node02

When I try any of the above mpirun commands, orted on node02 seems to start successfully, but nothing happens. For example, if I run the following on node01:

node01:~  % mpirun -host node02 uname

it hangs, and on node02 I get:

node02:~ % ps aux | grep orted
rbabich 7741 0.0 0.0 75656 1868 ? Ss 14:53 0:00 /usr/lib64/openmpi/1.2.5-gcc/bin/orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node02 --universe rbabich@node01:default-universe-8105 --nsreplica 0.0.0;tcp://192.168.0.101:52342 --gprreplica 0.0.0;tcp://192.168.0.101:52342

Any ideas?

Thanks,
Ron

Attachment: ompi_info-all.gz
Description: GNU Zip compressed data

Reply via email to