Hi,

I'm working with MPI_Comm_spawn and I have some error messages.

The code is relatively simple:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <mpi.h>

int main(int argc, char ** argv){

        int i;
        int rank, size, child_rank;
        char nomehost[20];
        MPI_Comm parent, intercomm1, intercomm2;
        int erro;
        int level, curr_level;


        MPI_Init(&argc, &argv);
        level = atoi(argv[1]);

        MPI_Comm_get_parent(&parent);

        if(parent == MPI_COMM_NULL){
                rank=0;
        }
        else{
                MPI_Recv(&rank, 1, MPI_INT, 0, 0, parent, MPI_STATUS_IGNORE);
        }

        curr_level = (int) log2(rank+1);

        printf(" --> rank: %d and curr_level: %d\n", rank, curr_level);

        // Node propagation
        if(curr_level < level){

                // 2^(curr_level+1) - 1 + 2*(rank - 2^curr_level - 1)
= 2*rank + 1
                child_rank = 2*rank + 1;
                printf("(%d) Before create rank %d\n", rank, child_rank);
                MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm1, &erro);
                printf("(%d) After create rank %d\n", rank, child_rank);

                MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm1);

                //sleep(1);

                child_rank = child_rank + 1;
                printf("(%d) Before create rank %d\n", rank, child_rank);
                MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm2, &erro);
                printf("(%d) After create rank %d\n", rank, child_rank);

                MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm2);

        }

        gethostname(nomehost, 20);
        printf("(%d) in %s\n", rank, nomehost);

        MPI_Finalize();
        return(0);

}

The program will create a binary tree of process until get a specific
level determined by the variable "level". If the level is 2, the tree
will be:
        (0)
      /     \
  (1)       (2)
  /  \       /  \
(3) (4)  (5) (6)

Error messages are (when a use 1 host):

Compiling: mpicc test.c -o test -lm
Running: mpirun -np 1 ./test 3

 --> rank: 0 and curr_level: 0
(0) Before create rank 1
(0) After create rank 1
(0) Before create rank 2
 --> rank: 1 and curr_level: 1
(1) Before create rank 3
[cacau.ic.uff.br:17892] [[31928,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 75

When I use 2 hosts, error is worst. The code is similar to the writing
here (I have to set hosts before spawn by MPI_Info_set).
Using MPILAM, program runs normally.

I think something wrong occurs when I try to use 2 MPI_Comm_spawn
consecutively and children processes spawn another processes too.
Seems to be a race condition because the error does not always happen
(when the level is 2, for example). Using 3 levels or more, error is
recurrent.

Similar error has been previously posted in another thread:
http://www.open-mpi.org/community/lists/users/2009/12/11601.php
However, I used the stable version 1.4.4 and this problem still happens.
Developers think of to fix it?

Thanks,
Fernanda

Reply via email to