Let's start with this: You generate non-blocking sends (MPI_Isend).
Those sends are not completed anywhere. So, strictly speaking, they
don't need to be executed. In practice, even if they are executed, they
should be "completed" from the user program's point of view (MPI_Test,
MPI_Wait, MPI_Waitall, etc.) to reclaim resources associated with the
requests.
So, you should start by fixing that. The question arises where you
should complete those send calls. I think there are several steps you
could take here to get what you're looking for:
1) Implement a version that works without worrying about "sleep"
behavior. In your case, you're sending messages in an all-to-all
pattern. So, for example, you could issue an MPI_Irecv for each
non-self process. Then, issue an MPI_Isend for each non-self process.
Then, issue MPI_Wait commands to complete all those requests.
2) (Optional step): consider alternative message patterns, like trees,
to cut down on all the message traffic.
3) Insert the "sleep" calls.
If you fix that and still have problems, let us know and let us know
what "interconnect" (possibly on-node shared memory) you're using and
what GCC version.
Gijsbert Wiesenekker wrote:
The following routine gives a problem after some (not reproducible) time on
Fedora Core 12. The routine is a CPU usage friendly version of MPI_Barrier.
The verbose output shows that if the problem occurs one of the (not
reproducible) nodes does not receive a message from one of the other (not
reproducible) nodes, so it looks like the message is lost or is never received.
This routine worked fine on Fedora Core 10 with OpenMPI 1.3.x and works fine on
Centos 5.3 with OpenMPI 1.3.x. The problem occurs with OpenMPI 1.3.x, OpenMPI
1.4, gcc and icc.
My question is: is there a problem with this routine that I overlooked that
somehow did not show up until now, and if not, how can I debug what causes this
problem. Is there a way to see which messages have been sent/received/are
pending?
Regards,
Gijsbert
local void my_barrier(char * info, MPI_Comm comm, int verbose)
{
int ncomm;
int comm_id;
int send[MPI_NPROCS_MAX];
MPI_Request request[MPI_NPROCS_MAX];
int icomm;
int done[MPI_NPROCS_MAX];
time_t t0, t1;
double wall[MPI_NPROCS_MAX];
double wall_max;
BUG(mpi_nprocs == 1)
MPI_Comm_size(comm, &ncomm);
BUG(ncomm < 1)
MPI_Comm_rank(comm, &comm_id);
my_printf("entering barrier %s %d %d\n", info, ncomm, comm_id);
for (icomm = 0; icomm < ncomm; icomm++) send[icomm] = comm_id;
for (icomm = 0; icomm < ncomm; icomm++)
{
if (icomm != comm_id)
{
if (verbose) my_printf("sending from %d to %d\n",
comm_id, icomm);
MPI_Isend(send + icomm, 1, MPI_INT, icomm,
MPI_BARRIER_TAG,
comm, request + icomm);
done[icomm] = FALSE;
}
else
{
done[icomm] = TRUE;
}
wall[icomm] = 0.0;
}
t0 = time(NULL);
while(TRUE)
{
int receive;
int flag;
MPI_Status status;
MPI_Iprobe(MPI_ANY_SOURCE, MPI_BARRIER_TAG,
comm, &flag, &status);
if (!flag)
{
my_sleep(0, BARRIER_POLL);
continue;
}
BUG(status.MPI_SOURCE < 0)
BUG(status.MPI_SOURCE >= ncomm)
MPI_Recv(&receive, 1, MPI_INT, status.MPI_SOURCE,
MPI_BARRIER_TAG,
comm, &status);
BUG(receive != status.MPI_SOURCE)
BUG(done[status.MPI_SOURCE])
if (verbose) my_printf("receiving from %d\n",
status.MPI_SOURCE);
t1 = time(NULL);
done[status.MPI_SOURCE] = TRUE;
wall[status.MPI_SOURCE] = difftime(t1, t0);
for (icomm = 0; icomm < ncomm; icomm++)
if (!done[icomm]) break;
if (icomm == ncomm) break;
}
my_printf("leaving barrier %s\n", info);
wall_max = 0;
for (icomm = 0; icomm < ncomm; icomm++)
{
if (verbose)
my_printf("icomm=%d time=%.0f%s\n",
icomm, wall[icomm], icomm == comm_id ? " *" :
"");
if (wall[icomm] > wall_max) wall_max = wall[icomm];
}
//to be sure
MPI_Barrier(comm);
MPI_Allreduce(MPI_IN_PLACE, &wall_max, 1,
MPI_DOUBLE, MPI_MAX, comm);
my_printf("mpi wall_max=%.0f\n", wall_max);
}