James --
Sorry for the delay in replying.
Do you have any firewall software running on your nodes (e.g.,
iptables)? OMPI uses random TCP ports to connect between nodes for
control messages. If they can't reach each other because TCP ports
are blocked, Bad Things will happen (potentially even a hang, because
firewalls can cause packets to be silently dropped).
On May 20, 2008, at 12:17 PM, Rudd, James wrote:
I have been trying to compile a molecular dynamics program with the
Openmpi 1.2.5 included in OFED 1.3. I am running Fedora Core 6; the
output of uname –r is 2.6.18-1.2798.fc6. I’ve traced the problems
I’ve been having back to openmpi because I’m unable to run the test
programs such as glob on more than one node. I currently have 2
nodes connected to an infiniband switch with opensm running on
node1. The nodes can ping each other and I am able to ssh between
them without a password. My openmpi-default-hostfile includes the
following:
node1 slots=2 max-slots=4
node2 slots=4 max-slots=4
When I run “mpirun -np 4 --debug-daemons ./glob” I get:
Daemon [0,0,1] checking in as pid 21341 on host node1
And the program appears to hang. Once I CTRL+C it a couple of times
I get the contents of error.txt
Per the instructions in the FAQ I’ve included the output of
“ibv_devinfo”, “ifconfig”, and “ulimit –l” in the
infiniband_info.txt file. The results of “ompi_info –all is in the
ompi_info.txt file.
I’ve been tearing my hear out over this, any help would be greatly
appreciated.
James Rudd
JLC-Biomedical/Biotechnology Research Institute
North Carolina Central University
700 George Street
Durham, NC 27707
Phone: (919) 530-7015
Email: jr...@nccu.edu
http://ariel.acc.nccu.edu/Academics/BBRI/personnel/rudd.htm
<error.txt><infiniband_info.txt><ompi_info.txt>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems