On Jan 30, 2007, at 9:35 AM, Fisher, Mark S wrote:
The master process uses both MPI_ANY_SOURCE and MPI_ANY_TAG while
waiting for requests from slave processes. The slaves sometimes use
MPI_ANY_TAG but the source is always specified.
I think you said that you only had corruption issues on the slave,
right? If so, the ANY_SOURCE/ANY_TAG on the master probably aren't
the issue.
But if you're doing ANY_TAG on the slaves, you might want to double
check that that code is doing exactly what you think it's doing. Are
there any race conditions such that a message could be received on
that ANY_TAG that you did not intend to receive there? Look
especially hard at non-blocking receives with ANY_TAG.
We have run the code through valgrid for a number of cases
including the
one being used here.
Excellent.
The code is Fortran 90 and we are using the FORTRAN 77 interface so
I do
not believe this is a problem.
Agreed; should not be an issue.
We are using Gigabit Ethernet.
Ok, good.
I could look at LAM again to see if it would work. The code needs
to be
in a specific working directory and we need some environment variable
set. This was not supported well in pre MPI 2. versions of MPI. For
MPICH1 I actually launch a script for the slaves so that we have the
proper setup before running the executable. Note I had tried that with
OpenMPI and it had an internal error in orterun. This is not a problem
Really? OMPI's mpirun does not depend on the executable being an MPI
application -- indeed, you can "mpirun -np 2 uptime" with no
problem. What problem did you run into here?
since the mpirun can setup everything we need. If you think it is
worth
while I will download and try it.
From what you describe, it sounds like order of messaging may be the
issue, not necessarily MPI handle types. So let's hold off on that
one for the moment (although LAM should be pretty straightforward to
try -- you should be able to mpirun scripts with no problems; perhaps
you can try it as a background effort when you have spare cycles /
etc.), and look at your slave code for receiving.
-----Original Message-----
From: Jeff Squyres [mailto:jsquy...@cisco.com]
Sent: Monday, January 29, 2007 7:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] Scrambled communications using ssh starter
onmultiple nodes.
Without analyzing your source, it's hard to say. I will say that OMPI
may send fragments out of order, but we do, of course, provide the
same
message ordering guarantees that MPI mandates. So let me ask a few
leading questions:
- Are you using any wildcards in your receives, such as MPI_ANY_SOURCE
or MPI_ANY_TAG?
- Have you run your code through a memory-checking debugger such as
valgrind?
- I don't know what Scali MPI uses, but MPICH and Intel MPI use
integers
for MPI handles. Have you tried LAM/MPI as well? It, like Open MPI,
uses pointers for MPI handles. I mention this because apps that
unintentionally have code that takes advantage of integer handles can
sometimes behave unpredictably when switching to a pointer-based MPI
implementation.
- What network interconnect are you using between the two hosts?
On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote:
Recently I wanted to try OpenMPI for use with our CFD flow solver
WINDUS. The code uses a master/slave methodology were the master
handles I/O and issues tasks for the slaves to perform. The original
parallel implementation was done in 1993 using PVM and in 1999 we
added support for MPI.
When testing the code with Openmpi 1.1.2 it ran fine when running
on a
single machine. As soon as I ran on more than one machine I started
getting random errors right away (the attached tar ball has a good
and
bad output). It looked like either the messages were out of order or
were for the other slave process. In the run mode used there is no
slave to slave communication. In the file the code died near the
beginning of the communication between master and slave. Sometimes it
will run further before it fails.
I have included a tar file with the build and configuration info. The
two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am
running real-time (no queue) using the ssh starter using the
following
appt file.
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
__bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent
/usr/bin/ssh --host copland -wdir /tmp/mpi.m209290 -np 2
./__bcfdbeta.exe
The above file fails but the following works:
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
__bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent
/usr/bin/ssh --host
skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
The first process is the master and the second two are the slaves.
I am
not sure what is going wrong, the code runs fine with many other MPI
distributions (MPICH1/2, Intel, Scali...). I assume that either I
built it wrong or am not running it properly but I cannot see what I
am doing wrong. Any help would be appreciated!
<<mpipb.tgz>>
<mpipb.tgz>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems