I tested my application with the snapshot and it works fine! thanks. márcia.
On Thu, Dec 17, 2009 at 6:48 PM, Ralph Castain <r...@open-mpi.org> wrote: > Will be in the 1.4 nightly tarball generated later tonight... > > Thanks again > Ralph > > On Dec 17, 2009, at 4:07 AM, Marcia Cristina Cera wrote: > > very good news!!!! > I will wait carefully for the release :) > > Thanks, Ralph > márcia. > > On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Ah crumb - I found the problem. Sigh. >> >> I actually fixed this in the trunk over 5 months ago when the problem >> first surfaced in my own testing, but it never came across to the stable >> release branch. The problem is that we weren't serializing the comm_spawn >> requests, and so the launch system gets confused over what has and hasn't >> completed launch. That's why it works fine on the trunk. >> >> I'm creating the 1.4 patch right now. Thanks for catching this. Old brain >> completely forgot until I started tracking the commit history and found my >> own footprints! >> >> Ralph >> >> On Dec 16, 2009, at 5:43 AM, Marcia Cristina Cera wrote: >> >> Hi Ralph, >> >> I am afraid I have been a little hasty! >> I remake my tests with more care and I got the same error also with the >> 1.3.3 :-/ >> but in such version the error happens after some successful executions... >> because of that I did not realize before! >> Furthermore, I increased the number of levels of the tree (that means have >> more concurrently dynamic process creations in the lower levels) and I never >> arrive to execute without error, unless I add the delay. >> Perhaps the problem might even be a race condition :( >> >> I test with LAM/MPI 7.1.4 and in a first moment it works fine. I have work >> with LAM for years, but I migrate o OpenMP last year once LAM will be >> discontinued... >> >> I think that I can continue the development of my application adding the >> delay, while I wait for a release... and I leave the performance tests to be >> made in the future :) >> >> Thank you again Ralph, >> márcia. >> >> >> On Wed, Dec 16, 2009 at 2:17 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Okay, I can replicate this. >>> >>> FWIW: your test program works fine with the OMPI trunk and 1.3.3. It >>> only has a problem with 1.4. Since I can replicate it on multiple machines >>> every single time, I don't think it is actually a race condition. >>> >>> I think someone made a change to the 1.4 branch that created a failure >>> mode :-/ >>> >>> Will have to get back to you on this - may take awhile, and won't be in >>> the 1.4.1 release. >>> >>> Thanks for the replicator! >>> >>> On Dec 15, 2009, at 8:35 AM, Marcia Cristina Cera wrote: >>> >>> Thank you, Ralph >>> >>> I will use the 1.3.3 for now... >>> while waiting for a future fix release that break this race condiction. >>> >>> márcia >>> >>> On Tue, Dec 15, 2009 at 12:58 PM, Ralph Castain <r...@open-mpi.org>wrote: >>> >>>> Looks to me like it is a race condition, and the timing between 1.3.3 >>>> and 1.4 is just enough to trip it. I can break the race, but it will have >>>> to >>>> be in a future fix release. >>>> >>>> Meantime, your best bet is to either stick with 1.3.3 or add the delay. >>>> >>>> On Dec 15, 2009, at 5:51 AM, Marcia Cristina Cera wrote: >>>> >>>> Hi, >>>> >>>> I intend to develop an application using the MPI_Comm_spawn to create >>>> dynamically new MPI tasks (or processes). >>>> The structure of the program is like a tree: each node creates 2 new >>>> ones until reaches a predefined number of levels. >>>> >>>> I developed a small program to explain my problem as can be seen in >>>> attachment. >>>> -- start.c: launches (through MPI_Comm_spawn, in which the argv has the >>>> level value) the root of the tree (a ch_rec program). Afterward spawn, a >>>> message is sent to child and the process block in an MPI_Recv. >>>> -- ch_rec.c: gets its level value and receives the parent message, then >>>> if its level is less than a predefined limit, it will creates 2 children: >>>> - set the level value; >>>> - spawn 1 child; >>>> - send a message; >>>> - call an MPI_Irecv; >>>> - repeat the 4 previous steps for the second child; >>>> - call an MPI_Waitany waiting for children returns. >>>> When children messages are received, the process send a message to its >>>> parent and call MPI_Finalize. >>>> >>>> Using the openmpi-1.3.3 version the program runs as expected but with >>>> openmpi-1.4 I get the following error: >>>> >>>> $ mpirun -np 1 start >>>> level 0 >>>> level = 1 >>>> Parent sent: level 0 (pid:4279) >>>> level = 2 >>>> Parent sent: level 1 (pid:4281) >>>> [xiru-8.portoalegre.grenoble.grid5000.fr:04278] [[42824,0],0] >>>> ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line >>>> 758 >>>> >>>> The error happens when my program try to launch the second child >>>> immediately after the first spawn call. >>>> In my tests I try to put an sleep of 2 second between the first and the >>>> second spawn, and then the program runs as expected. >>>> >>>> Some one can help me with this version 1.4 bug? >>>> >>>> thanks, >>>> márcia. >>>> >>>> <spawn-problem.tar.gz>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >