James --

Sorry for the delay in replying.

Do you have any firewall software running on your nodes (e.g., iptables)? OMPI uses random TCP ports to connect between nodes for control messages. If they can't reach each other because TCP ports are blocked, Bad Things will happen (potentially even a hang, because firewalls can cause packets to be silently dropped).


On May 20, 2008, at 12:17 PM, Rudd, James wrote:

I have been trying to compile a molecular dynamics program with the Openmpi 1.2.5 included in OFED 1.3. I am running Fedora Core 6; the output of uname –r is 2.6.18-1.2798.fc6. I’ve traced the problems I’ve been having back to openmpi because I’m unable to run the test programs such as glob on more than one node. I currently have 2 nodes connected to an infiniband switch with opensm running on node1. The nodes can ping each other and I am able to ssh between them without a password. My openmpi-default-hostfile includes the following:

node1 slots=2 max-slots=4
node2 slots=4 max-slots=4

When I run “mpirun -np 4 --debug-daemons ./glob” I get:
Daemon [0,0,1] checking in as pid 21341 on host node1
And the program appears to hang. Once I CTRL+C it a couple of times I get the contents of error.txt

Per the instructions in the FAQ I’ve included the output of “ibv_devinfo”, “ifconfig”, and “ulimit –l” in the infiniband_info.txt file. The results of “ompi_info –all is in the ompi_info.txt file.

I’ve been tearing my hear out over this, any help would be greatly appreciated.

James Rudd
JLC-Biomedical/Biotechnology Research Institute
North Carolina Central University
700 George Street
Durham, NC 27707
Phone:  (919) 530-7015
Email:  jr...@nccu.edu
http://ariel.acc.nccu.edu/Academics/BBRI/personnel/rudd.htm

<error.txt><infiniband_info.txt><ompi_info.txt>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


Reply via email to