The master process uses both MPI_ANY_SOURCE and MPI_ANY_TAG while
waiting for requests from slave processes. The slaves sometimes use
MPI_ANY_TAG but the source is always specified.

We have run the code through valgrid for a number of cases including the
one being used here. 

The code is Fortran 90 and we are using the FORTRAN 77 interface so I do
not believe this is a problem.

We are using Gigabit Ethernet. 

I could look at LAM again to see if it would work. The code needs to be
in a specific working directory and we need some environment variable
set. This was not supported well in pre MPI 2. versions of MPI. For
MPICH1 I actually launch a script for the slaves so that we have the
proper setup before running the executable. Note I had tried that with
OpenMPI and it had an internal error in orterun. This is not a problem
since the mpirun can setup everything we need. If you think it is worth
while I will download and try it.

-----Original Message-----
From: Jeff Squyres [mailto:jsquy...@cisco.com] 
Sent: Monday, January 29, 2007 7:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] Scrambled communications using ssh starter
onmultiple nodes.

Without analyzing your source, it's hard to say.  I will say that OMPI
may send fragments out of order, but we do, of course, provide the same
message ordering guarantees that MPI mandates.  So let me ask a few
leading questions:

- Are you using any wildcards in your receives, such as MPI_ANY_SOURCE
or MPI_ANY_TAG?

- Have you run your code through a memory-checking debugger such as
valgrind?

- I don't know what Scali MPI uses, but MPICH and Intel MPI use integers
for MPI handles.  Have you tried LAM/MPI as well?  It, like Open MPI,
uses pointers for MPI handles.  I mention this because apps that
unintentionally have code that takes advantage of integer handles can
sometimes behave unpredictably when switching to a pointer-based MPI
implementation.

- What network interconnect are you using between the two hosts?



On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote:

> Recently I wanted to try OpenMPI for use with our CFD flow solver 
> WINDUS. The code uses a master/slave methodology were the master 
> handles I/O and issues tasks for the slaves to perform. The original 
> parallel implementation was done in 1993 using PVM and in 1999 we 
> added support for MPI.
>
> When testing the code with Openmpi 1.1.2 it ran fine when running on a

> single machine. As soon as I ran on more than one machine I started 
> getting random errors right away (the attached tar ball has a good and

> bad output). It looked like either the messages were out of order or 
> were for the other slave process. In the run mode used there is no 
> slave to slave communication. In the file the code died near the 
> beginning of the communication between master and slave. Sometimes it 
> will run further before it fails.
>
> I have included a tar file with the build and configuration info. The 
> two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am 
> running real-time (no queue) using the ssh starter using the following

> appt file.
>
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2  -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./ 
> __bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent 
> /usr/bin/ssh --host copland -wdir /tmp/mpi.m209290 -np 2 
> ./__bcfdbeta.exe
>
> The above file fails but the following works:
>
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2  -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./ 
> __bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent 
> /usr/bin/ssh --host
> skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
>
> The first process is the master and the second two are the slaves.  
> I am
> not sure what is going wrong, the code runs fine with many other MPI 
> distributions (MPICH1/2, Intel, Scali...). I assume that either I 
> built it wrong or am not running it properly but I cannot see what I 
> am doing wrong. Any help would be appreciated!
>
>  <<mpipb.tgz>>
> <mpipb.tgz>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to