On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Using direct can cause scaling issues as every process will open a socket > to every other process in the job. You would at least have to ensure you > have enough file descriptors available on every node. > The most likely cause is either (a) a different OMPI version getting picked > up on one of the nodes, or (b) something blocking communication between at > least one of your other nodes. I would suspect the latter - perhaps a > firewall or something? > > I''m disturbed by your not seeing any error output - that seems strange. > Try adding --debug-daemons to the cmd line. That should definitely generate > output from every daemon (at the least, they report they are alive). > > Ralph > Nifty, I used MPI_Get_processor_name - as you said, this is much more helpful output. I also check all the versions and they seem to be fine - 'mpirun -V' says 1.3.3 on all 4 machines. The output with '-mca routed direct' is now (correctly): [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-mpi Process 0 on daviddoria out of 4 Process 1 on cloud3 out of 4 Process 2 on cloud4 out of 4 Process 3 on cloud6 out of 4 Here is the output with --debug-daemons. Is there a particular port / set of ports I can have my system admin unblock on the firewall to see if that fixes it? [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached --debug-daemons -np 4 hello-mpi Daemon was launched on cloud3 - beginning to initialize Daemon [[9461,0],1] checking in as pid 14707 on host cloud3 Daemon [[9461,0],1] not using static ports [cloud3:14707] [[9461,0],1] orted: up and running - waiting for commands! Daemon was launched on cloud4 - beginning to initialize Daemon [[9461,0],2] checking in as pid 5987 on host cloud4 Daemon [[9461,0],2] not using static ports [cloud4:05987] [[9461,0],2] orted: up and running - waiting for commands! Daemon was launched on cloud6 - beginning to initialize Daemon [[9461,0],3] checking in as pid 1037 on host cloud6 Daemon [[9461,0],3] not using static ports [daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0 arch ffca0200 [daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200 [daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200 [daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200 [daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs [cloud6:01037] [[9461,0],3] orted: up and running - waiting for commands! [cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch ffca0200 [cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200 [cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200 [cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200 [cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch ffca0200 [cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200 [cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200 [cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200 [cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs [cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs [daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap from local proc [[9461,1],0] [daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd [cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from local proc [[9461,1],2] [daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd [cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd Any more thoughts? Thanks, David