On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Using direct can cause scaling issues as every process will open a socket
> to every other process in the job. You would at least have to ensure you
> have enough file descriptors available on every node.
> The most likely cause is either (a) a different OMPI version getting picked
> up on one of the nodes, or (b) something blocking communication between at
> least one of your other nodes. I would suspect the latter - perhaps a
> firewall or something?
>
> I''m disturbed by your not seeing any error output - that seems strange.
> Try adding --debug-daemons to the cmd line. That should definitely generate
> output from every daemon (at the least, they report they are alive).
>
> Ralph
>

Nifty, I used MPI_Get_processor_name - as you said, this is much more
helpful output. I also check all the versions and they seem to be fine -
'mpirun -V' says 1.3.3 on all 4 machines.

The output with '-mca routed direct' is now (correctly):
[doriad@daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-mpi
Process 0 on daviddoria out of 4
Process 1 on cloud3 out of 4
Process 2 on cloud4 out of 4
Process 3 on cloud6 out of 4

Here is the output with --debug-daemons.

Is there a particular port / set of ports I can have my system admin unblock
on the firewall to see if that fixes it?

[doriad@daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached
--debug-daemons -np 4 hello-mpi


Daemon was launched on cloud3 - beginning to initialize
Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
Daemon [[9461,0],1] not using static ports
[cloud3:14707] [[9461,0],1] orted: up and running - waiting for commands!
Daemon was launched on cloud4 - beginning to initialize
Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
Daemon [[9461,0],2] not using static ports
[cloud4:05987] [[9461,0],2] orted: up and running - waiting for commands!
Daemon was launched on cloud6 - beginning to initialize
Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
Daemon [[9461,0],3] not using static ports
[daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0 arch
ffca0200
[daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
[daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
[cloud6:01037] [[9461,0],3] orted: up and running - waiting for commands!
[cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch ffca0200
[cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
[cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
[cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch ffca0200
[cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
[cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
[cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
[cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
[daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap from local
proc [[9461,1],0]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from local proc
[[9461,1],2]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd

Any more thoughts?

Thanks,

David

Reply via email to