Hi,

I intend to develop an application using the MPI_Comm_spawn to create
dynamically new MPI tasks (or processes).
The structure of the program is like a tree: each node creates 2 new ones
until reaches a predefined number of levels.

I developed a small program to explain my problem as can be seen in
attachment.
-- start.c: launches (through MPI_Comm_spawn, in which the argv has the
level value) the root of the tree (a ch_rec program). Afterward spawn, a
message is sent to  child and the process block in an MPI_Recv.
-- ch_rec.c: gets its level value and receives the parent message, then if
its level is less than a predefined limit, it will creates 2 children:
        - set the level value;
        - spawn 1 child;
        - send a message;
        - call an MPI_Irecv;
        - repeat the 4 previous steps for the second child;
        - call an MPI_Waitany waiting for children returns.
When children messages are received, the process send a message to its
parent and call MPI_Finalize.

Using the openmpi-1.3.3 version the program runs as expected but with
openmpi-1.4 I get the following error:

$ mpirun -np 1 start
level 0
level = 1
Parent sent: level 0 (pid:4279)
level = 2
Parent sent: level 1 (pid:4281)
[xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0]
ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758

The error happens when my program try to launch the second child immediately
after the first spawn call.
In my tests I try to put an sleep of 2 second between the first and the
second spawn, and then the program runs as expected.

Some one can help me with this version 1.4 bug?

thanks,
márcia.

Attachment: spawn-problem.tar.gz
Description: GNU Zip compressed data

Reply via email to