Will be in the 1.4 nightly tarball generated later tonight...

Thanks again
Ralph

On Dec 17, 2009, at 4:07 AM, Marcia Cristina Cera wrote:

> very good news!!!! 
> I will wait carefully for the release :)
> 
> Thanks, Ralph
> márcia.
> 
> On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Ah crumb - I found the problem. Sigh.
> 
> I actually fixed this in the trunk over 5 months ago when the problem first 
> surfaced in my own testing, but it never came across to the stable release 
> branch. The problem is that we weren't serializing the comm_spawn requests, 
> and so the launch system gets confused over what has and hasn't completed 
> launch. That's why it works fine on the trunk.
> 
> I'm creating the 1.4 patch right now. Thanks for catching this. Old brain 
> completely forgot until I started tracking the commit history and found my 
> own footprints!
> 
> Ralph
> 
> On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote:
> 
>> Hi Ralph,
>> 
>> I am afraid I have been a little hasty!
>> I remake my tests with more care and I got the same error also with the 
>> 1.3.3 :-/
>> but in such version the error happens after some successful executions... 
>> because of that I did not realize before!
>> Furthermore, I increased the number of levels of the tree (that means have 
>> more concurrently dynamic process creations in the lower levels) and I never 
>> arrive to execute without error, unless I add the delay. 
>> Perhaps the problem might even be a race condition :(
>> 
>> I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work 
>> with LAM for years, but I migrate o OpenMP last year once LAM will be 
>> discontinued... 
>> 
>> I think that I can continue the development of my application adding the 
>> delay, while I wait for a release... and I leave the performance tests to be 
>> made in the future :)
>> 
>> Thank you again Ralph,
>> márcia.
>> 
>> 
>> On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> Okay, I can replicate this.
>> 
>> FWIW: your  test program works fine with the OMPI trunk and 1.3.3. It only 
>> has a problem with 1.4. Since I can replicate it on multiple machines every 
>> single time, I don't think it is actually a race condition.
>> 
>> I think someone made a change to the 1.4 branch that created a failure mode 
>> :-/
>> 
>> Will have to get back to you on this - may take awhile, and won't be in the 
>> 1.4.1 release.
>> 
>> Thanks for the replicator!
>> 
>> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote:
>> 
>>> Thank you, Ralph
>>> 
>>> I will use the 1.3.3 for now... 
>>> while waiting for a future fix release that break this race condiction.
>>> 
>>> márcia
>>> 
>>> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Looks to me like it is a race condition, and the timing between 1.3.3 and 
>>> 1.4 is just enough to trip it. I can break the race, but it will have to be 
>>> in a future fix release.
>>> 
>>> Meantime, your best bet is to either stick with 1.3.3 or add the delay.
>>> 
>>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I intend to develop an application using the MPI_Comm_spawn to create 
>>>> dynamically new MPI tasks (or processes). 
>>>> The structure of the program is like a tree: each node creates 2 new ones 
>>>> until reaches a predefined number of levels.
>>>> 
>>>> I developed a small program to explain my problem as can be seen in 
>>>> attachment.
>>>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the 
>>>> level value) the root of the tree (a ch_rec program). Afterward spawn, a 
>>>> message is sent to  child and the process block in an MPI_Recv.
>>>> -- ch_rec.c: gets its level value and receives the parent message, then if 
>>>> its level is less than a predefined limit, it will creates 2 children: 
>>>>         - set the level value;
>>>>         - spawn 1 child;
>>>>         - send a message;
>>>>         - call an MPI_Irecv;
>>>>         - repeat the 4 previous steps for the second child;
>>>>         - call an MPI_Waitany waiting for children returns.
>>>> When children messages are received, the process send a message to its 
>>>> parent and call MPI_Finalize.
>>>> 
>>>> Using the openmpi-1.3.3 version the program runs as expected but with 
>>>> openmpi-1.4 I get the following error:
>>>> 
>>>> $ mpirun -np 1 start
>>>> level 0
>>>> level = 1
>>>> Parent sent: level 0 (pid:4279)
>>>> level = 2
>>>> Parent sent: level 1 (pid:4281)
>>>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] 
>>>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 
>>>> 758
>>>> 
>>>> The error happens when my program try to launch the second child 
>>>> immediately after the first spawn call. 
>>>> In my tests I try to put an sleep of 2 second between the first and the 
>>>> second spawn, and then the program runs as expected.
>>>> 
>>>> Some one can help me with this version 1.4 bug? 
>>>> 
>>>> thanks,
>>>> márcia.
>>>> 
>>>> <spawn-problem.tar.gz>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to