In trying to run a simple "hello world" type program to test my MPI setup,
I've come across an interesting problem I can't seem to work out. But first,
a bit about my setup:

I have 3 dual-core Athlon machine all running ubuntu 8.04 and they've been
readied with openmpi 1.2.6. The program I'm trying to run is the following
simple test:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

#define RING_TAG  0xdead
#define RING_ROOT 0

int main (int argc, char *argv[]) {
    int size   = 0;
    int rank   = 0;
    int next   = 0;
    int prev   = 0;
    int value  = 0;
    int result = 0;
    int gresult = 0;
    MPI_Status status;
    MPI_Request request;

char * host;
host = getenv ("HOSTNAME");
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    //sleep(20);

    if ( 1 < size ) {
        /* compute the neighbours */
        next = (rank+1) % size;
        prev = (size + (rank-1)) % size;

        /* post recv */
        MPI_Irecv(&value, 1, MPI_INT, prev, RING_TAG, MPI_COMM_WORLD,
&request);

        /* send data */
        MPI_Send(&rank, 1, MPI_INT, next, RING_TAG, MPI_COMM_WORLD);

        /* wait for data */
        MPI_Wait(&request, &status);

        /* validate data */
        if ( value != prev ) {
            result = 1;
        } else {
            result = 0;
        }

        /* gather results */
    printf ("%s - %d) Before\n", host, rank); fflush(stdout);

        MPI_Reduce(&result, &gresult, 1, MPI_INT, MPI_SUM, RING_ROOT,
MPI_COMM_WORLD);
        //MPI_Allreduce(&result, &gresult, 1, MPI_INT, MPI_SUM,
MPI_COMM_WORLD);

    printf ("%s - %d) After\n", host, rank); fflush(stdout);


        if ( rank == RING_ROOT ) {
            if ( 0 == gresult ) {
                printf("PASSED : %i errors.\n", gresult);
            } else {
                printf("FAILED : %i errors.\n", gresult);
            }
        }


    } else {
        printf ("You have to use more than 1 process.\n");
    }

    MPI_Finalize();

    return 0;
}






This program runs just fine under the following conditions:

1) If I run on a single node
2) If I run on multiple nodes but change the MPI_Reduce operation to
anything else (MPI_Bcast, MPI_Allreduce, etc)

But it hangs if I run on multiple nodes and keep the MPI_Reduce as it is.

The problem is especially frustrating because there is no reason that all
the other functions should work without a problem, and the Reduce operations
causes the entire process to hang. The symptoms are as follows during a
hang:

1) Output ends (I get some of the printf() statements through, but some of
the cores on any of the nodes will never get to the "After" printed
statement) and hangs.
2) Checking the other nodes with a 'top' command shows that the correct
number of processes are being executed and run at 100% of the cpu.

I'm confident that this isn't an issue with path settings or environment
variables, as I mentioned before that the program executes and finishes just
fine when anything other than an MPI_Reduce is used.

Has anyone encountered a problem like this?

Thank you,
~Eric

Reply via email to