[OMPI users] Backwards compatibility?
Is OpenMPI backwards compatible? I.e. If I am running 1.3.1 on one machine and 1.3.3 on the rest, is it supposed to work? Or do they all need exactly the same version? When I add this wrong version machine to the machinelist, with a simple "hello world from each process type program", I see no output what soever, even with the verbose flag - it just sits there indefinitely. Thanks, David
Re: [OMPI users] Backwards compatibility?
On Thu, Jul 23, 2009 at 5:47 PM, Ralph Castain wrote: > I doubt those two would work together - however, a combination of 1.3.2 and > 1.3.3 should. > > You might look at the ABI compatibility discussion threads (there have been > several) on this list for the reasons. Basically, binary compatibility is > supported starting with 1.3.2 and above. Ok - I'll make sure to use all the same version. Is there anyway that can be detected and an error thrown? It took me quite a while to figure out that one machine was the wrong version. Thanks, David
[OMPI users] Test works with 3 computers, but not 4?
I wrote a simple program to display "hello world" from each process. When I run this (126 - my machine, 122, and 123), everything works fine: [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123 hello-mpi >From process 1 out of 3, Hello World! >From process 2 out of 3, Hello World! >From process 3 out of 3, Hello World! When I run this (126 - my machine, 122, and 125), everything works fine: [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.125 hello-mpi >From process 2 out of 3, Hello World! >From process 1 out of 3, Hello World! >From process 3 out of 3, Hello World! When I run this (126 - my machine, 123, and 125), everything works fine: [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.123,10.1.2.125 hello-mpi >From process 2 out of 3, Hello World! >From process 1 out of 3, Hello World! >From process 3 out of 3, Hello World! However, when I run this (126 - my machine, 122, 123, AND 125), I get no output at all. Is there any way to check what is going on / does anyone know what that would happen? I'm using OpenMPI 1.3.3 Thanks, David
Re: [OMPI users] Test works with 3 computers, but not 4?
On Wed, Jul 29, 2009 at 3:42 PM, Ralph Castain wrote: > It sounds like perhaps IOF messages aren't getting relayed along the > daemons. Note that the daemon on each node does have to be able to send TCP > messages to all other nodes, not just mpirun. > > Couple of things you can do to check: > > 1. -mca routed direct - this will send all messages direct instead of > across the daemons > > 2. --leave-session-attached - will allow you to see any errors reported by > the daemons, including those from attempting to relay messages > > Ralph > > Ralph, thanks for the quick response. With -mca routed direct it works correctly. With this: mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached -np 4 /home/doriad/MPITest/hello-mpi I still get no output nor errors from the daemons. Is there a downside to using 'mca routed direct'? Or should I fix whatever is causing this daemon issue? You have any other tests for me to try to see what's wrong? Thanks, David
Re: [OMPI users] Test works with 3 computers, but not 4?
On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain wrote: > Using direct can cause scaling issues as every process will open a socket > to every other process in the job. You would at least have to ensure you > have enough file descriptors available on every node. > The most likely cause is either (a) a different OMPI version getting picked > up on one of the nodes, or (b) something blocking communication between at > least one of your other nodes. I would suspect the latter - perhaps a > firewall or something? > > I''m disturbed by your not seeing any error output - that seems strange. > Try adding --debug-daemons to the cmd line. That should definitely generate > output from every daemon (at the least, they report they are alive). > > Ralph > Nifty, I used MPI_Get_processor_name - as you said, this is much more helpful output. I also check all the versions and they seem to be fine - 'mpirun -V' says 1.3.3 on all 4 machines. The output with '-mca routed direct' is now (correctly): [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-mpi Process 0 on daviddoria out of 4 Process 1 on cloud3 out of 4 Process 2 on cloud4 out of 4 Process 3 on cloud6 out of 4 Here is the output with --debug-daemons. Is there a particular port / set of ports I can have my system admin unblock on the firewall to see if that fixes it? [doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached --debug-daemons -np 4 hello-mpi Daemon was launched on cloud3 - beginning to initialize Daemon [[9461,0],1] checking in as pid 14707 on host cloud3 Daemon [[9461,0],1] not using static ports [cloud3:14707] [[9461,0],1] orted: up and running - waiting for commands! Daemon was launched on cloud4 - beginning to initialize Daemon [[9461,0],2] checking in as pid 5987 on host cloud4 Daemon [[9461,0],2] not using static ports [cloud4:05987] [[9461,0],2] orted: up and running - waiting for commands! Daemon was launched on cloud6 - beginning to initialize Daemon [[9461,0],3] checking in as pid 1037 on host cloud6 Daemon [[9461,0],3] not using static ports [daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0 arch ffca0200 [daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200 [daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200 [daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200 [daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs [cloud6:01037] [[9461,0],3] orted: up and running - waiting for commands! [cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch ffca0200 [cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200 [cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200 [cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200 [cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch ffca0200 [cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200 [cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200 [cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200 [cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs [cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs [daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap from local proc [[9461,1],0] [daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd [cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from local proc [[9461,1],2] [daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd [cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd Any more thoughts? Thanks, David
Re: [OMPI users] Test works with 3 computers, but not 4?
On Wed, Jul 29, 2009 at 4:57 PM, Ralph Castain wrote: > Ah, so there is a firewall involved? That is always a problem. I gather > that node 126 has clear access to all other nodes, but nodes 122, 123, and > 125 do not all have access to each other? > See if your admin is willing to open at least one port on each > node that can reach all other nodes. It is easiest if it is the same port for > every node, > but not required. Then you can try setting > the mca params oob_tcp_port_minv4 and > oob_tcp_port_rangev4. This should allow the daemons to communicate. > > Check ompi_info --param oob tcp for info on those (and other) params. > > Ralph > > On Jul 29, 2009, at 2:46 PM, David Doria wrote: > > Machine 125 had the default fedora firewall turned on. I turned it off and it works now with simply mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 hello-mpi (the firewalls on the rest of the machines were already off in an attempt to avoid problems like this - I guess I just forgot one!) Is there a "standard" port I can open on these local firewalls so I don't have to disable them completely and so I don't have to set mca params oob_tcp_port_X ? Thanks, David
[OMPI users] Two remote machines - asymmetric behavior
I have three machines: mine (daviddoria) and two identical remote machines (cloud3 and cloud6). I can password-less ssh between any pair. The machines are all 32bit running Fedora 11. OpenMPI was installed identically on each. The .bashrc is identical on each. /etc/hosts is identical on each. I wrote a test "hello world" program to ensure OpenMPI is behaving correctly. The output is exactly as expected, each node seems to be alive. [doriad@daviddoria MPITest]$ mpirun -H cloud6,daviddoria,cloud3 -np 3 hello-mpi Process 1 on daviddoria out of 3 Process 2 on cloud3 out of 3 Process 0 on cloud6 out of 3 I am trying to get a parallel application called Paraview working with these three machines. Paraview is installed identically on each. As a test, I wanted to get it working with two at a time first. With cloud3, everything goes smoothly, that is, I tell Paraview to start the server with ssh cloud3 mpirun -H cloud3 pvserver and to connect to the server on cloud3, and I get the following (expected) output: Listen on port: 1 Waiting for client... Client connected. When I try the same thing on cloud6, it again goes smoothly (I tell Paraview to start the server with ssh cloud6 mpirun -H cloud6 pvserver and connect to the server on cloud6) Now for the real test... I tell Paraview to start the server with ssh cloud6 mpirun -H cloud6,cloud3 -np 2 pvserver and connect to the server on cloud6 This again connects successfully. However, if I do the reverse: ssh cloud3 mpirun -H cloud3,cloud6 -np 2 pvserver and connect to the server on cloud3 it tries and tries for 60 seconds but it can't connect. I just see a bunch of "Failed to connect to server on cloud3" errors. Does anyone have any idea what could cause this asymmetric behavior? Thanks, David
Re: [OMPI users] Two remote machines - asymmetric behavior
> I'm a newbie, so forgive me if I ask something stupid: > > why are You running ssh command before mpirun command? I'm interested in > setting up a paraview server on a LAN to pos-tprocess OpenFOAM > simulation data. > > Just a total newbish comment: doesn't the mpirun in fact call for the > ssh anyway? And if pvserver is to be run on multiple machines and is > programmed in Open MPI shouldn't > > mpirun -np procNumber -H host1,host2,host3 pvserver > > be enough to get it going, as well as any other parallel program? Again, > please excuse my newbiness. > > Best regards, > > Tomislav > Tomislav, As is probably apparent from my email(s), I am very new to all of this as well. >From my understanding, to start the server on cloud3 from my machine (daviddoria), you must use the command ssh cloud3 mpirun pvserver If you use simply mpirun pvserver that will start the server on daviddoria. Can anyone confirm or deny? Thanks, David
Re: [OMPI users] Two remote machines - asymmetric behavior
On Mon, Aug 3, 2009 at 9:47 AM, Ralph Castain wrote: > You are both correct. If you simply type "mpirun pvserver", then we will > execute pvserver on whatever machine is local. > > However, if you type "mpirun -n 1 -H host1 pvserver", then we will start > pvserver on the specified host. Note that mpirun will still be executing on > your local machine - but pvserver will be running on the specified host. > > Ralph > Ralph, Does anything change based on where mpirun is executing? Can you shed any light on the initial question of asymmetric behavior? Thanks, David
Re: [OMPI users] Two remote machines - asymmetric behavior
On Mon, Aug 3, 2009 at 1:41 PM, Ralph Castain wrote: > The only thing that changes is the required connectivity. It sounds to me > like you may have a firewall issue here, where cloud3 is blocking > connectivity from cloud6, but cloud6 is allowing connectivity from cloud3. > > Is there a firewall in operation, per chance? > > Ralph > I have turned off all firewalls. This is the wireshark log of the traffic on one of the machines in the case that does not work: http://rpi.edu/~doriad/cloud3_cloud3+6.wsk I see a bunch of red lines, where in any of the working tests cases I do not see any red lines. As you can tell from my analysis, I am an expert wireshark user (haha!) - can anyone interpret the problem from this file? Maybe I need to ask the wireshark mailing list for an analysis? Thanks, David