Guillaume THOMAS-COLLIGNON wrote:
Hi,

I wrote an application which works fine on a small number of nodes (eg. 4), but it crashes on a large number of CPUs.

In this application, all the slaves send many small messages to the master. I use the regular MPI_Send, and since the messages are relatively small (1 int, then many times 3296 ints), OpenMPI does a very good job at sending them asynchronously, and it maxes out the gigabit link on the master node. I'm very happy with this behaviour, it gives me the same performance as if I was doing all the asynchronous stuff myself, and the code remains simple.

But it crashes when there are too many slaves.
How many is too many? I successfully ran your code on 96 nodes, with 4 processes per node and it seemed to work fine. Also, what network are you using?

So it looks like at some point the master node runs out of buffers and the job crashes brutally.
What do you mean by crashing? Is there a segfault or an error message?

Tim

That's my understanding but I may be wrong.
If I use explicit synchronous sends (MPI_Ssend), it does not crash anymore but the performance is a lot lower.

I have 2 questions regarding this :

1) What kind of tuning would help handling more messages and keep the master from crashing ?

2) Is this the expected behaviour ? I don't think my code is doing anything wrong, so I would not expect a brutal crash.


The workaround I've found so far is to do an MPI_Ssend for the request, then use MPI_Send for the data blocks. So all the slaves are blocked on the request, it keeps the master from being flooded, and the performance is still good. But nothing tells me it won't crash at some point if I have more data blocks in my real code, so I'd like to know more about what's happening here.

Thanks,

        -Guillaume


Here is the code, so you get a better idea of the communication scheme, or if you someone wants to reproduce the problem.


#include <stdio.h>
#include <stdlib.h>

#include <mpi.h>

#define BLOCKSIZE 3296
#define MAXBLOCKS 1000
#define NLOOP 4

int main (int argc, char **argv) {
   int i, j, ier, rank, npes, slave, request;
   int *data;
   MPI_Status status;

   MPI_Init (&argc, &argv);
   MPI_Comm_rank (MPI_COMM_WORLD, &rank);
   MPI_Comm_size (MPI_COMM_WORLD, &npes);

   if ((data = (int *) calloc (BLOCKSIZE, sizeof (int))) == NULL)
     return -10;

   // Master
   if (rank == 0) {
     // Expect (NLOOP * number of slaves) requests
     for (i=0; i<(npes-1)*NLOOP; i++) {
/* Wait for a request from any slave. Request contains number of data blocks */ ier = MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, 964, MPI_COMM_WORLD, &status);
       if (ier != MPI_SUCCESS)
        return -1;
       slave = status.MPI_SOURCE;
printf ("Master : request for %d blocks from slave %d\n", request, slave);

       /* Receive the data blocks from this slave */
       for (j=0; j<request; j++) {
ier = MPI_Recv (data, BLOCKSIZE, MPI_INT, slave, 993, MPI_COMM_WORLD, &status);
        if (ier != MPI_SUCCESS)
          return -2;
       }
     }
   }
   // Slaves
   else {
     for (i=0; i<NLOOP; i++) {
/* Send the request = number of blocks we want to send to the master */
       request = MAXBLOCKS;
/* Changing this MPI_Send to MPI_Ssend is enough to keep the master from being flooded */
       ier = MPI_Send (&request, 1, MPI_INT, 0, 964, MPI_COMM_WORLD);
       if (ier != MPI_SUCCESS)
        return -3;
       /* Send the data blocks */
       for (j=0; j<request; j++) {
        ier = MPI_Send (data, BLOCKSIZE, MPI_INT, 0, 993, MPI_COMM_WORLD);
        if (ier != MPI_SUCCESS)
          return -4;
       }
     }
   }
   printf ("Node %d done\n", rank);
   MPI_Finalize ();
}


Reply via email to