Recently I wanted to try OpenMPI for use with our CFD flow solver WINDUS. The code uses a master/slave methodology were the master handles I/O and issues tasks for the slaves to perform. The original parallel implementation was done in 1993 using PVM and in 1999 we added support for MPI.
When testing the code with Openmpi 1.1.2 it ran fine when running on a single machine. As soon as I ran on more than one machine I started getting random errors right away (the attached tar ball has a good and bad output). It looked like either the messages were out of order or were for the other slave process. In the run mode used there is no slave to slave communication. In the file the code died near the beginning of the communication between master and slave. Sometimes it will run further before it fails. I have included a tar file with the build and configuration info. The two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am running real-time (no queue) using the ssh starter using the following appt file. -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./__bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host copland -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe The above file fails but the following works: -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./__bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe The first process is the master and the second two are the slaves. I am not sure what is going wrong, the code runs fine with many other MPI distributions (MPICH1/2, Intel, Scali...). I assume that either I built it wrong or am not running it properly but I cannot see what I am doing wrong. Any help would be appreciated! <<mpipb.tgz>>
mpipb.tgz
Description: mpipb.tgz