Okay, I can replicate this.

FWIW: your  test program works fine with the OMPI trunk and 1.3.3. It only has 
a problem with 1.4. Since I can replicate it on multiple machines every single 
time, I don't think it is actually a race condition.

I think someone made a change to the 1.4 branch that created a failure mode :-/

Will have to get back to you on this - may take awhile, and won't be in the 
1.4.1 release.

Thanks for the replicator!

On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote:

> Thank you, Ralph
> 
> I will use the 1.3.3 for now... 
> while waiting for a future fix release that break this race condiction.
> 
> márcia
> 
> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Looks to me like it is a race condition, and the timing between 1.3.3 and 1.4 
> is just enough to trip it. I can break the race, but it will have to be in a 
> future fix release.
> 
> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
> 
> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
> 
>> Hi,
>> 
>> I intend to develop an application using the MPI_Comm_spawn to create 
>> dynamically new MPI tasks (or processes). 
>> The structure of the program is like a tree: each node creates 2 new ones 
>> until reaches a predefined number of levels.
>> 
>> I developed a small program to explain my problem as can be seen in 
>> attachment.
>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the 
>> level value) the root of the tree (a ch_rec program). Afterward spawn, a 
>> message is sent to  child and the process block in an MPI_Recv.
>> -- ch_rec.c: gets its level value and receives the parent message, then if 
>> its level is less than a predefined limit, it will creates 2 children: 
>>         - set the level value;
>>         - spawn 1 child;
>>         - send a message;
>>         - call an MPI_Irecv;
>>         - repeat the 4 previous steps for the second child;
>>         - call an MPI_Waitany waiting for children returns.
>> When children messages are received, the process send a message to its 
>> parent and call MPI_Finalize.
>> 
>> Using the openmpi-1.3.3 version the program runs as expected but with 
>> openmpi-1.4 I get the following error:
>> 
>> $ mpirun -np 1 start
>> level 0
>> level = 1
>> Parent sent: level 0 (pid:4279)
>> level = 2
>> Parent sent: level 1 (pid:4281)
>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] 
>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758
>> 
>> The error happens when my program try to launch the second child immediately 
>> after the first spawn call. 
>> In my tests I try to put an sleep of 2 second between the first and the 
>> second spawn, and then the program runs as expected.
>> 
>> Some one can help me with this version 1.4 bug? 
>> 
>> thanks,
>> márcia.
>> 
>> <spawn-problem.tar.gz>_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to