Guillaume THOMAS-COLLIGNON wrote:
Hi,
I wrote an application which works fine on a small number of nodes
(eg. 4), but it crashes on a large number of CPUs.
In this application, all the slaves send many small messages to the
master. I use the regular MPI_Send, and since the messages are
relatively small (1 int, then many times 3296 ints), OpenMPI does a
very good job at sending them asynchronously, and it maxes out the
gigabit link on the master node. I'm very happy with this behaviour,
it gives me the same performance as if I was doing all the
asynchronous stuff myself, and the code remains simple.
But it crashes when there are too many slaves.
How many is too many? I successfully ran your code on 96 nodes, with 4
processes per node and it seemed to work fine. Also, what network are
you using?
So it looks like at
some point the master node runs out of buffers and the job crashes
brutally.
What do you mean by crashing? Is there a segfault or an error message?
Tim
That's my understanding but I may be wrong.
If I use explicit synchronous sends (MPI_Ssend), it does not crash
anymore but the performance is a lot lower.
I have 2 questions regarding this :
1) What kind of tuning would help handling more messages and keep the
master from crashing ?
2) Is this the expected behaviour ? I don't think my code is doing
anything wrong, so I would not expect a brutal crash.
The workaround I've found so far is to do an MPI_Ssend for the
request, then use MPI_Send for the data blocks. So all the slaves are
blocked on the request, it keeps the master from being flooded, and
the performance is still good. But nothing tells me it won't crash at
some point if I have more data blocks in my real code, so I'd like to
know more about what's happening here.
Thanks,
-Guillaume
Here is the code, so you get a better idea of the communication
scheme, or if you someone wants to reproduce the problem.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define BLOCKSIZE 3296
#define MAXBLOCKS 1000
#define NLOOP 4
int main (int argc, char **argv) {
int i, j, ier, rank, npes, slave, request;
int *data;
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &npes);
if ((data = (int *) calloc (BLOCKSIZE, sizeof (int))) == NULL)
return -10;
// Master
if (rank == 0) {
// Expect (NLOOP * number of slaves) requests
for (i=0; i<(npes-1)*NLOOP; i++) {
/* Wait for a request from any slave. Request contains number
of data blocks */
ier = MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, 964,
MPI_COMM_WORLD, &status);
if (ier != MPI_SUCCESS)
return -1;
slave = status.MPI_SOURCE;
printf ("Master : request for %d blocks from slave %d\n",
request, slave);
/* Receive the data blocks from this slave */
for (j=0; j<request; j++) {
ier = MPI_Recv (data, BLOCKSIZE, MPI_INT, slave, 993,
MPI_COMM_WORLD, &status);
if (ier != MPI_SUCCESS)
return -2;
}
}
}
// Slaves
else {
for (i=0; i<NLOOP; i++) {
/* Send the request = number of blocks we want to send to the
master */
request = MAXBLOCKS;
/* Changing this MPI_Send to MPI_Ssend is enough to keep the master
from being flooded */
ier = MPI_Send (&request, 1, MPI_INT, 0, 964, MPI_COMM_WORLD);
if (ier != MPI_SUCCESS)
return -3;
/* Send the data blocks */
for (j=0; j<request; j++) {
ier = MPI_Send (data, BLOCKSIZE, MPI_INT, 0, 993, MPI_COMM_WORLD);
if (ier != MPI_SUCCESS)
return -4;
}
}
}
printf ("Node %d done\n", rank);
MPI_Finalize ();
}