Ah crumb - I found the problem. Sigh.

I actually fixed this in the trunk over 5 months ago when the problem first 
surfaced in my own testing, but it never came across to the stable release 
branch. The problem is that we weren't serializing the comm_spawn requests, and 
so the launch system gets confused over what has and hasn't completed launch. 
That's why it works fine on the trunk.

I'm creating the 1.4 patch right now. Thanks for catching this. Old brain 
completely forgot until I started tracking the commit history and found my own 
footprints!

Ralph

On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote:

> Hi Ralph,
> 
> I am afraid I have been a little hasty!
> I remake my tests with more care and I got the same error also with the 1.3.3 
> :-/
> but in such version the error happens after some successful executions... 
> because of that I did not realize before!
> Furthermore, I increased the number of levels of the tree (that means have 
> more concurrently dynamic process creations in the lower levels) and I never 
> arrive to execute without error, unless I add the delay. 
> Perhaps the problem might even be a race condition :(
> 
> I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work 
> with LAM for years, but I migrate o OpenMP last year once LAM will be 
> discontinued... 
> 
> I think that I can continue the development of my application adding the 
> delay, while I wait for a release... and I leave the performance tests to be 
> made in the future :)
> 
> Thank you again Ralph,
> márcia.
> 
> 
> On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Okay, I can replicate this.
> 
> FWIW: your  test program works fine with the OMPI trunk and 1.3.3. It only 
> has a problem with 1.4. Since I can replicate it on multiple machines every 
> single time, I don't think it is actually a race condition.
> 
> I think someone made a change to the 1.4 branch that created a failure mode 
> :-/
> 
> Will have to get back to you on this - may take awhile, and won't be in the 
> 1.4.1 release.
> 
> Thanks for the replicator!
> 
> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote:
> 
>> Thank you, Ralph
>> 
>> I will use the 1.3.3 for now... 
>> while waiting for a future fix release that break this race condiction.
>> 
>> márcia
>> 
>> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Looks to me like it is a race condition, and the timing between 1.3.3 and 
>> 1.4 is just enough to trip it. I can break the race, but it will have to be 
>> in a future fix release.
>> 
>> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
>> 
>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
>> 
>>> Hi,
>>> 
>>> I intend to develop an application using the MPI_Comm_spawn to create 
>>> dynamically new MPI tasks (or processes). 
>>> The structure of the program is like a tree: each node creates 2 new ones 
>>> until reaches a predefined number of levels.
>>> 
>>> I developed a small program to explain my problem as can be seen in 
>>> attachment.
>>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the 
>>> level value) the root of the tree (a ch_rec program). Afterward spawn, a 
>>> message is sent to  child and the process block in an MPI_Recv.
>>> -- ch_rec.c: gets its level value and receives the parent message, then if 
>>> its level is less than a predefined limit, it will creates 2 children: 
>>>         - set the level value;
>>>         - spawn 1 child;
>>>         - send a message;
>>>         - call an MPI_Irecv;
>>>         - repeat the 4 previous steps for the second child;
>>>         - call an MPI_Waitany waiting for children returns.
>>> When children messages are received, the process send a message to its 
>>> parent and call MPI_Finalize.
>>> 
>>> Using the openmpi-1.3.3 version the program runs as expected but with 
>>> openmpi-1.4 I get the following error:
>>> 
>>> $ mpirun -np 1 start
>>> level 0
>>> level = 1
>>> Parent sent: level 0 (pid:4279)
>>> level = 2
>>> Parent sent: level 1 (pid:4281)
>>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] 
>>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 758
>>> 
>>> The error happens when my program try to launch the second child 
>>> immediately after the first spawn call. 
>>> In my tests I try to put an sleep of 2 second between the first and the 
>>> second spawn, and then the program runs as expected.
>>> 
>>> Some one can help me with this version 1.4 bug? 
>>> 
>>> thanks,
>>> márcia.
>>> 
>>> <spawn-problem.tar.gz>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to