Jeff Squyres wrote:
(for the web archives)

Brock and I talked about this .f90 code a bit off list -- he's going to investigate with the test author a bit more because both of us are a bit confused by the F90 array syntax used.
Attached is a simple send/recv code written (procedural) C++ that
illustrates a similar problem.  It dies at a random number of iterations
with openmpi-1.3.2 or .3. (I have submitted this before.) On some machines
this goes away with the "-mca btl_sm_num_fifos 8" or
"-mca btl ^sm", so I think this is
https://svn.open-mpi.org/trac/ompi/ticket/2043.


Since it has barriers after each send/recv pair, I do not understand how
any buffers could fill up.

Various stats:

iter.cary$ uname -a
Linux iter.txcorp.com 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
iter.cary$ g++ --version
g++ (GCC) 4.4.0 20090506 (Red Hat 4.4.0-4)
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

iter.cary$ mpicxx -show
g++ -I/usr/local/openmpi-1.3.2-nodlopen/include -pthread -L/usr/local/torque-2.4.0b1/lib -Wl,--rpath -Wl,/usr/local/torque-2.4.0b1/lib -Wl,-rpath,/usr/local/openmpi-1.3.2-nodlopen/lib -L/usr/local/openmpi-1.3.2-nodlopen/lib -lmpi_cxx -lmpi -lopen-rte -lopen-pal -ltorque -ldl -lnsl -lutil -lm

Seen failures on 64 bit hardware only.


John Cary



On Dec 1, 2009, at 10:46 AM, Brock Palen wrote:

The attached code, is an example where openmpi/1.3.2 will lock up, if
ran on 48 cores, of IB (4 cores per node),
The code loops over recv from all processors on rank 0 and sends from
all other ranks, as far as I know this should work, and I can't see
why not.
Note yes I know we can do the same thing with a gather, this is a
simple case to demonstrate the issue.
Note that if I increase the openib eager limit, the program runs,
which normally means improper MPI, but I can't on my own figure out
the problem with this code.

Any input on why code like this locks up, unless we up the eager
buffer would be helpful, as we should be be having to up the buffer
size, just to make code run, makes me feel hacky and dirty.


<sendbuf.f90><ATT9198877.txt><ATT9198879.txt>



/**
 * A simple test program to demonstrate a problem in OpenMPI 1.3
 *
 * Make with:
 * mpicxx -o ompi1.3.3-bug ompi1.3.3-bug.cxx
 *
 * Run with:
 * mpirun -n 3 ompi1.3.3-bug
 */

// mpi includes
#include <mpi.h>

// std includes
#include <iostream>
#include <vector>

// useful hashdefine
#define ARRAY_SIZE 250

/**
 * Main driver
 */
int main(int argc, char** argv) {
// Initialize MPI
  MPI_Init(&argc, &argv);

  int rk, sz;
  MPI_Comm_rank(MPI_COMM_WORLD, &rk);
  MPI_Comm_size(MPI_COMM_WORLD, &sz);

// Create some data to pass around
  std::vector<double> d(ARRAY_SIZE);

// Initialize to some values if we aren't rank 0
  if ( rk )
    for ( unsigned i = 0; i < ARRAY_SIZE; ++i )
      d[i] = 2*i + 1;

// Loop until this breaks
  unsigned t = 0;
  while ( 1 ) {
    MPI_Status s;
    if ( rk )
      MPI_Send( &d[0], d.size(), MPI_DOUBLE, 0, 3, MPI_COMM_WORLD );
    else
      for ( int i = 1; i < sz; ++i )
        MPI_Recv( &d[0], d.size(), MPI_DOUBLE, i, 3, MPI_COMM_WORLD, &s );
    MPI_Barrier(MPI_COMM_WORLD);
    std::cout << "Transmission " << ++t << " completed." << std::endl;
  }

// Finalize MPI
  MPI_Finalize();
}

Reply via email to