In trying to run a simple "hello world" type program to test my MPI setup, I've come across an interesting problem I can't seem to work out. But first, a bit about my setup:
I have 3 dual-core Athlon machine all running ubuntu 8.04 and they've been readied with openmpi 1.2.6. The program I'm trying to run is the following simple test: #include <mpi.h> #include <stdio.h> #include <stdlib.h> #define RING_TAG 0xdead #define RING_ROOT 0 int main (int argc, char *argv[]) { int size = 0; int rank = 0; int next = 0; int prev = 0; int value = 0; int result = 0; int gresult = 0; MPI_Status status; MPI_Request request; char * host; host = getenv ("HOSTNAME"); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); //sleep(20); if ( 1 < size ) { /* compute the neighbours */ next = (rank+1) % size; prev = (size + (rank-1)) % size; /* post recv */ MPI_Irecv(&value, 1, MPI_INT, prev, RING_TAG, MPI_COMM_WORLD, &request); /* send data */ MPI_Send(&rank, 1, MPI_INT, next, RING_TAG, MPI_COMM_WORLD); /* wait for data */ MPI_Wait(&request, &status); /* validate data */ if ( value != prev ) { result = 1; } else { result = 0; } /* gather results */ printf ("%s - %d) Before\n", host, rank); fflush(stdout); MPI_Reduce(&result, &gresult, 1, MPI_INT, MPI_SUM, RING_ROOT, MPI_COMM_WORLD); //MPI_Allreduce(&result, &gresult, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); printf ("%s - %d) After\n", host, rank); fflush(stdout); if ( rank == RING_ROOT ) { if ( 0 == gresult ) { printf("PASSED : %i errors.\n", gresult); } else { printf("FAILED : %i errors.\n", gresult); } } } else { printf ("You have to use more than 1 process.\n"); } MPI_Finalize(); return 0; } This program runs just fine under the following conditions: 1) If I run on a single node 2) If I run on multiple nodes but change the MPI_Reduce operation to anything else (MPI_Bcast, MPI_Allreduce, etc) But it hangs if I run on multiple nodes and keep the MPI_Reduce as it is. The problem is especially frustrating because there is no reason that all the other functions should work without a problem, and the Reduce operations causes the entire process to hang. The symptoms are as follows during a hang: 1) Output ends (I get some of the printf() statements through, but some of the cores on any of the nodes will never get to the "After" printed statement) and hangs. 2) Checking the other nodes with a 'top' command shows that the correct number of processes are being executed and run at 100% of the cpu. I'm confident that this isn't an issue with path settings or environment variables, as I mentioned before that the program executes and finishes just fine when anything other than an MPI_Reduce is used. Has anyone encountered a problem like this? Thank you, ~Eric