I am trying to debug an issue which is really weird. I have simple MPI hello world application (attached) that hangs when I try to run on our cluster using 256 nodes with 16 cores on each node. The cluster uses QDR IB.
I am able to run the test over ethernet by excluding openib from the btl. However, what is weird is that for the same set of nodes xhpl completes without any error using 256 nodes and 16 cores. I have tried running the Pallas MPI Benchmark and it also behaves similarly to hello world and ends up hanging when I run it using 256 nodes. When I attach gdb to the MPI processes and look at the backtrace I see that close ~1000 of the MPI processes are stuck in MPI_Send while the others are waiting in MPI_Finalize. I have checked to make sure that the ulimit setting for locked memory is unlimited. The number of open files per process is 131072. The default MPI stack provided is openmpi-1.6.1 on the system. I compiled openmpi-1.6.3 in my home directory and the behavior remains to be the same. I would appreciate any help in debugging this issue. -- Bharath
#include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char *argv[]) { char msg[256], name[MPI_MAX_PROCESSOR_NAME]; int rank, size, src; MPI_Status status; MPI_Init(&argc, &argv); MPI_Get_processor_name(name, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); sprintf(msg, "Hello world from process %d of %d runing on %s\n", rank, size, name); if (rank == 0) { printf("%s", msg); for (src = 1; src < size; ++src) { MPI_Recv(msg, sizeof(msg), MPI_BYTE, src, 1, MPI_COMM_WORLD, &status); printf("%s", msg); } } else { MPI_Send(msg, strlen(msg) + 1, MPI_BYTE, 0, 1, MPI_COMM_WORLD); } MPI_Finalize(); return EXIT_SUCCESS; }
smime.p7s
Description: S/MIME cryptographic signature