I am trying to debug an issue which is really weird. I have
simple MPI hello world application (attached) that hangs when I
try to run on our cluster using 256 nodes with 16 cores on each
node. The cluster uses QDR IB.

I am able to run the test over ethernet by excluding openib from
the btl. However, what is weird is that for the same set of nodes
xhpl completes without any error using 256 nodes and 16 cores. I
have tried running the Pallas MPI Benchmark and it also behaves
similarly to hello world and ends up hanging when I run it using
256 nodes.

When I attach gdb to the MPI processes and look at the backtrace
I see that close ~1000 of the MPI processes are stuck in MPI_Send
while the others are waiting in MPI_Finalize. I have checked to
make sure that the ulimit setting for locked memory is unlimited.
The number of open files per process is 131072. The default MPI
stack provided is openmpi-1.6.1 on the system. I compiled
openmpi-1.6.3 in my home directory and the behavior remains to be
the same.

I would appreciate any help in debugging this issue.

-- 
Bharath
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
        char msg[256], name[MPI_MAX_PROCESSOR_NAME];
        int rank, size, src;
	MPI_Status status;

        MPI_Init(&argc, &argv);
        MPI_Get_processor_name(name, &size);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
	sprintf(msg, "Hello world from process %d of %d runing on %s\n", rank, size, name);
	if (rank == 0) {
		printf("%s", msg);
		for (src = 1; src < size; ++src) {
			MPI_Recv(msg, sizeof(msg), MPI_BYTE, src, 1, MPI_COMM_WORLD, &status);
			printf("%s", msg);
		}
	} else {
		MPI_Send(msg, strlen(msg) + 1, MPI_BYTE, 0, 1, MPI_COMM_WORLD);
	}

        MPI_Finalize();

        return EXIT_SUCCESS;
}

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to