Recently I wanted to try OpenMPI for use with our CFD flow solver
WINDUS. The code uses a master/slave methodology were the master handles
I/O and issues tasks for the slaves to perform. The original parallel
implementation was done in 1993 using PVM and in 1999 we added support
for MPI.

When testing the code with Openmpi 1.1.2 it ran fine when running on a
single machine. As soon as I ran on more than one machine I started
getting random errors right away (the attached tar ball has a good and
bad output). It looked like either the messages were out of order or
were for the other slave process. In the run mode used there is no slave
to slave communication. In the file the code died near the beginning of
the communication between master and slave. Sometimes it will run
further before it fails. 

I have included a tar file with the build and configuration info. The
two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am
running real-time (no queue) using the ssh starter using the following
appt file.

-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2  -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./__bcfdbeta.exe
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
copland -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe

The above file fails but the following works:

-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2  -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./__bcfdbeta.exe
-x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe

The first process is the master and the second two are the slaves. I am
not sure what is going wrong, the code runs fine with many other MPI
distributions (MPICH1/2, Intel, Scali...). I assume that either I built
it wrong or am not running it properly but I cannot see what I am doing
wrong. Any help would be appreciated!

 <<mpipb.tgz>> 

Attachment: mpipb.tgz
Description: mpipb.tgz

Reply via email to