The master process uses both MPI_ANY_SOURCE and MPI_ANY_TAG while waiting for requests from slave processes. The slaves sometimes use MPI_ANY_TAG but the source is always specified.
We have run the code through valgrid for a number of cases including the one being used here. The code is Fortran 90 and we are using the FORTRAN 77 interface so I do not believe this is a problem. We are using Gigabit Ethernet. I could look at LAM again to see if it would work. The code needs to be in a specific working directory and we need some environment variable set. This was not supported well in pre MPI 2. versions of MPI. For MPICH1 I actually launch a script for the slaves so that we have the proper setup before running the executable. Note I had tried that with OpenMPI and it had an internal error in orterun. This is not a problem since the mpirun can setup everything we need. If you think it is worth while I will download and try it. -----Original Message----- From: Jeff Squyres [mailto:jsquy...@cisco.com] Sent: Monday, January 29, 2007 7:54 PM To: Open MPI Users Subject: Re: [OMPI users] Scrambled communications using ssh starter onmultiple nodes. Without analyzing your source, it's hard to say. I will say that OMPI may send fragments out of order, but we do, of course, provide the same message ordering guarantees that MPI mandates. So let me ask a few leading questions: - Are you using any wildcards in your receives, such as MPI_ANY_SOURCE or MPI_ANY_TAG? - Have you run your code through a memory-checking debugger such as valgrind? - I don't know what Scali MPI uses, but MPICH and Intel MPI use integers for MPI handles. Have you tried LAM/MPI as well? It, like Open MPI, uses pointers for MPI handles. I mention this because apps that unintentionally have code that takes advantage of integer handles can sometimes behave unpredictably when switching to a pointer-based MPI implementation. - What network interconnect are you using between the two hosts? On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote: > Recently I wanted to try OpenMPI for use with our CFD flow solver > WINDUS. The code uses a master/slave methodology were the master > handles I/O and issues tasks for the slaves to perform. The original > parallel implementation was done in 1993 using PVM and in 1999 we > added support for MPI. > > When testing the code with Openmpi 1.1.2 it ran fine when running on a > single machine. As soon as I ran on more than one machine I started > getting random errors right away (the attached tar ball has a good and > bad output). It looked like either the messages were out of order or > were for the other slave process. In the run mode used there is no > slave to slave communication. In the file the code died near the > beginning of the communication between master and slave. Sometimes it > will run further before it fails. > > I have included a tar file with the build and configuration info. The > two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am > running real-time (no queue) using the ssh starter using the following > appt file. > > -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host > skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./ > __bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent > /usr/bin/ssh --host copland -wdir /tmp/mpi.m209290 -np 2 > ./__bcfdbeta.exe > > The above file fails but the following works: > > -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host > skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./ > __bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent > /usr/bin/ssh --host > skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe > > The first process is the master and the second two are the slaves. > I am > not sure what is going wrong, the code runs fine with many other MPI > distributions (MPICH1/2, Intel, Scali...). I assume that either I > built it wrong or am not running it properly but I cannot see what I > am doing wrong. Any help would be appreciated! > > <<mpipb.tgz>> > <mpipb.tgz> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users