Re: [OMPI users] Program deadlocks, on simple send/recv loop

Eugene Loh Wed, 2 Dec 2009 01:10:42 -0500

John R. Cary wrote:

Jeff Squyres wrote:
(for the web archives)
Brock and I talked about this .f90 code a bit off list -- he's goingto investigate with the test author a bit more because both of us area bit confused by the F90 array syntax used.
Attached is a simple send/recv code written (procedural) C++ that
illustrates a similar problem.  It dies at a random number of iterations
with openmpi-1.3.2 or .3. (I have submitted this before.) On somemachines
this goes away with the "-mca btl_sm_num_fifos 8" or
"-mca btl ^sm", so I think this is
https://svn.open-mpi.org/trac/ompi/ticket/2043.

I suppose so. GCC 4.4.0. We've made a bit of progress on thisrecently, but again I don't know how much further we have to go. Iposted a C-only stand-alone example to the ticket, but would appreciateanyone jumping in and looking at it further. George has taken a peek sofar.

Since it has barriers after each send/recv pair, I do not understandhow any buffers could fill up.

Right. For 2043, it seems there is a race condition when two processwrite to the same, on-node receiver. It's possible to observe theproblem with nothing but repeated barriers.

On Dec 1, 2009, at 10:46 AM, Brock Palen wrote:

The attached code, is an example where openmpi/1.3.2 will lock up, if
ran on 48 cores, of IB (4 cores per node),
The code loops over recv from all processors on rank 0 and sends from
all other ranks, as far as I know this should work, and I can't see
why not.

Okay. Presumably the IB part is irrelevent. Just having one node withmultiple senders sending to a common receiver should do the job.

Note that if I increase the openib eager limit, the program runs,
which normally means improper MPI, but I can't on my own figure out
the problem with this code.

This conflicts with the theory that it's trac 2043. Similarly, havinglonger messages *suggests* (but does not prove) that the problem issomething else.

/**
* A simple test program to demonstrate a problem in OpenMPI 1.3
*
* Make with:
* mpicxx -o ompi1.3.3-bug ompi1.3.3-bug.cxx
*
* Run with:
* mpirun -n 3 ompi1.3.3-bug
*/

// mpi includes
#include <mpi.h>

// std includes
#include <iostream>
#include <vector>

// useful hashdefine
#define ARRAY_SIZE 250

/**
* Main driver
*/
int main(int argc, char** argv) {
// Initialize MPI
 MPI_Init(&argc, &argv);

 int rk, sz;
 MPI_Comm_rank(MPI_COMM_WORLD, &rk);
 MPI_Comm_size(MPI_COMM_WORLD, &sz);

// Create some data to pass around
 std::vector<double> d(ARRAY_SIZE);

// Initialize to some values if we aren't rank 0
 if ( rk )
   for ( unsigned i = 0; i < ARRAY_SIZE; ++i )
     d[i] = 2*i + 1;

// Loop until this breaks
 unsigned t = 0;
 while ( 1 ) {
   MPI_Status s;
   if ( rk )
     MPI_Send( &d[0], d.size(), MPI_DOUBLE, 0, 3, MPI_COMM_WORLD );
   else
     for ( int i = 1; i < sz; ++i )
       MPI_Recv( &d[0], d.size(), MPI_DOUBLE, i, 3, MPI_COMM_WORLD, &s );
   MPI_Barrier(MPI_COMM_WORLD);
   std::cout << "Transmission " << ++t << " completed." << std::endl;
 }

// Finalize MPI
 MPI_Finalize();
}

Re: [OMPI users] Program deadlocks, on simple send/recv loop

Reply via email to