Using direct can cause scaling issues as every process will open a
socket to every other process in the job. You would at least have to
ensure you have enough file descriptors available on every node.
The most likely cause is either (a) a different OMPI version getting
picked up on one of the nodes, or (b) something blocking communication
between at least one of your other nodes. I would suspect the latter -
perhaps a firewall or something?
I''m disturbed by your not seeing any error output - that seems
strange. Try adding --debug-daemons to the cmd line. That should
definitely generate output from every daemon (at the least, they
report they are alive).
Ralph
On Jul 29, 2009, at 2:06 PM, David Doria wrote:
On Wed, Jul 29, 2009 at 3:42 PM, Ralph Castain <r...@open-mpi.org>
wrote:
It sounds like perhaps IOF messages aren't getting relayed along the
daemons. Note that the daemon on each node does have to be able to
send TCP messages to all other nodes, not just mpirun.
Couple of things you can do to check:
1. -mca routed direct - this will send all messages direct instead
of across the daemons
2. --leave-session-attached - will allow you to see any errors
reported by the daemons, including those from attempting to relay
messages
Ralph
Ralph, thanks for the quick response.
With
-mca routed direct
it works correctly.
With this:
mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-
session-attached -np 4 /home/doriad/MPITest/hello-mpi
I still get no output nor errors from the daemons.
Is there a downside to using 'mca routed direct'? Or should I fix
whatever is causing this daemon issue? You have any other tests for
me to try to see what's wrong?
Thanks,
David
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users