On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain <r...@open-mpi.org>
wrote:
Using direct can cause scaling issues as every process will open a
socket to every other process in the job. You would at least have to
ensure you have enough file descriptors available on every node.
The most likely cause is either (a) a different OMPI version getting
picked up on one of the nodes, or (b) something blocking
communication between at least one of your other nodes. I would
suspect the latter - perhaps a firewall or something?
I''m disturbed by your not seeing any error output - that seems
strange. Try adding --debug-daemons to the cmd line. That should
definitely generate output from every daemon (at the least, they
report they are alive).
Ralph
Nifty, I used MPI_Get_processor_name - as you said, this is much
more helpful output. I also check all the versions and they seem to
be fine - 'mpirun -V' says 1.3.3 on all 4 machines.
The output with '-mca routed direct' is now (correctly):
[doriad@daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-
mpi
Process 0 on daviddoria out of 4
Process 1 on cloud3 out of 4
Process 2 on cloud4 out of 4
Process 3 on cloud6 out of 4
Here is the output with --debug-daemons.
Is there a particular port / set of ports I can have my system admin
unblock on the firewall to see if that fixes it?
[doriad@daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached
--debug-daemons -np 4 hello-mpi
Daemon was launched on cloud3 - beginning to initialize
Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
Daemon [[9461,0],1] not using static ports
[cloud3:14707] [[9461,0],1] orted: up and running - waiting for
commands!
Daemon was launched on cloud4 - beginning to initialize
Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
Daemon [[9461,0],2] not using static ports
[cloud4:05987] [[9461,0],2] orted: up and running - waiting for
commands!
Daemon was launched on cloud6 - beginning to initialize
Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
Daemon [[9461,0],3] not using static ports
[daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0
arch ffca0200
[daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
[daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
[cloud6:01037] [[9461,0],3] orted: up and running - waiting for
commands!
[cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch
ffca0200
[cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
[cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
[cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch
ffca0200
[cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
[cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
[cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
[cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
[daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap
from local proc [[9461,1],0]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data
cmd
[cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from
local proc [[9461,1],2]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data
cmd
[cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd
Any more thoughts?
Thanks,
David
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users