Re: [OMPI users] Scrambled communications using ssh starter onmultiple nodes.

Jeff Squyres Tue, 30 Jan 2007 09:44:20 -0500

On Jan 30, 2007, at 9:35 AM, Fisher, Mark S wrote:

The master process uses both MPI_ANY_SOURCE and MPI_ANY_TAG while
waiting for requests from slave processes. The slaves sometimes use
MPI_ANY_TAG but the source is always specified.

I think you said that you only had corruption issues on the slave,right? If so, the ANY_SOURCE/ANY_TAG on the master probably aren'tthe issue.

But if you're doing ANY_TAG on the slaves, you might want to doublecheck that that code is doing exactly what you think it's doing. Arethere any race conditions such that a message could be received onthat ANY_TAG that you did not intend to receive there? Lookespecially hard at non-blocking receives with ANY_TAG.

We have run the code through valgrid for a number of casesincluding the
one being used here.


Excellent.

The code is Fortran 90 and we are using the FORTRAN 77 interface soI do
not believe this is a problem.


Agreed; should not be an issue.

We are using Gigabit Ethernet.


Ok, good.

I could look at LAM again to see if it would work. The code needsto be

in a specific working directory and we need some environment variable
set. This was not supported well in pre MPI 2. versions of MPI. For
MPICH1 I actually launch a script for the slaves so that we have the
proper setup before running the executable. Note I had tried that with
OpenMPI and it had an internal error in orterun. This is not a problem

Really? OMPI's mpirun does not depend on the executable being an MPIapplication -- indeed, you can "mpirun -np 2 uptime" with noproblem. What problem did you run into here?

since the mpirun can setup everything we need. If you think it isworth
while I will download and try it.

From what you describe, it sounds like order of messaging may be theissue, not necessarily MPI handle types. So let's hold off on thatone for the moment (although LAM should be pretty straightforward totry -- you should be able to mpirun scripts with no problems; perhapsyou can try it as a background effort when you have spare cycles /etc.), and look at your slave code for receiving.

-----Original Message-----
From: Jeff Squyres [mailto:jsquy...@cisco.com]
Sent: Monday, January 29, 2007 7:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] Scrambled communications using ssh starter
onmultiple nodes.

Without analyzing your source, it's hard to say.  I will say that OMPI

may send fragments out of order, but we do, of course, provide thesame

message ordering guarantees that MPI mandates.  So let me ask a few
leading questions:

- Are you using any wildcards in your receives, such as MPI_ANY_SOURCE
or MPI_ANY_TAG?

- Have you run your code through a memory-checking debugger such as
valgrind?

- I don't know what Scali MPI uses, but MPICH and Intel MPI useintegers

for MPI handles.  Have you tried LAM/MPI as well?  It, like Open MPI,
uses pointers for MPI handles.  I mention this because apps that
unintentionally have code that takes advantage of integer handles can
sometimes behave unpredictably when switching to a pointer-based MPI
implementation.

- What network interconnect are you using between the two hosts?



On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote:

Recently I wanted to try OpenMPI for use with our CFD flow solver
WINDUS. The code uses a master/slave methodology were the master
handles I/O and issues tasks for the slaves to perform. The original
parallel implementation was done in 1993 using PVM and in 1999 we
added support for MPI.

When testing the code with Openmpi 1.1.2 it ran fine when runningon a

single machine. As soon as I ran on more than one machine I started
getting random errors right away (the attached tar ball has a goodand

bad output). It looked like either the messages were out of order or
were for the other slave process. In the run mode used there is no
slave to slave communication. In the file the code died near the
beginning of the communication between master and slave. Sometimes it
will run further before it fails.

I have included a tar file with the build and configuration info. The
two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am

running real-time (no queue) using the ssh starter using thefollowing

appt file.

-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2  -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
__bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent
/usr/bin/ssh --host copland -wdir /tmp/mpi.m209290 -np 2
./__bcfdbeta.exe

The above file fails but the following works:

-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2  -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
__bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent
/usr/bin/ssh --host
skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe

The first process is the master and the second two are the slaves.
I am
not sure what is going wrong, the code runs fine with many other MPI
distributions (MPICH1/2, Intel, Scali...). I assume that either I
built it wrong or am not running it properly but I cannot see what I
am doing wrong. Any help would be appreciated!

 <<mpipb.tgz>>
<mpipb.tgz>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

Re: [OMPI users] Scrambled communications using ssh starter onmultiple nodes.

Reply via email to