Fixed - sorry about that!
> On Jan 15, 2015, at 10:39 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Ah, indeed - I found the problem. Fix coming momentarily > >> On Jan 15, 2015, at 10:31 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Hmmm…I’m not seeing a failure. Let me try on another system. >> >> >> Modifying libevent is not a viable solution :-( >> >> >>> On Jan 15, 2015, at 10:26 AM, Leonid <lchis...@pathscale.com> wrote: >>> >>> Hi Ralph. >>> >>> Of course that may indicate an issue with custom compiler, but given that >>> it fails with gcc and inserted delay I still think it is a OMPI bug, since >>> such a delay could be caused by operating system at that exact point. >>> >>> For me simply commenting out "base->event_gotterm = base->event_break = 0;" >>> seems to do the trick, but I am not completely sure if that won't cause any >>> other troubles. >>> >>> I've tried to update my master branch to the latest version (including your >>> fix) but now it just crashes for me on *all* benchmarks that I am trying >>> (both with gcc and our compiler). >>> >>> On 15.01.2015 18:57, Ralph Castain wrote: >>>> Thought about this some more and realized that the orte progress engine >>>> wasn’t using the opal_progress_thread support functions, which include a >>>> “break” event to kick us out of just such problems. So I changed it on the >>>> master. From your citing of libevent 2.0.22, I believe that must be where >>>> you are working, yes? >>>> >>>> If so, give the changed version a try and see if your problem is resolved. >>>> >>>> >>>>> On Jan 15, 2015, at 12:55 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> Given that you could only reproduce it with either your custom compiler >>>>> or by forcibly introducing a delay, is this indicating an issue with the >>>>> custom compiler? It does seem strange that we don't see this anywhere >>>>> else, given the number of times that code gets run. >>>>> >>>>> Only alternative solution I can think of would be to push the finalize >>>>> request into the event loop, and thus execute the loopbreak from within >>>>> an event. You might try and see if that solves the problem. >>>>> >>>>> >>>>>> On Jan 14, 2015, at 11:54 PM, Leonid <lchis...@pathscale.com> wrote: >>>>>> >>>>>> Hi all. >>>>>> >>>>>> I believe there is a bug in event_base_loop() function from file event.c >>>>>> (opal/mca/event/libevent2022/libevent/). >>>>>> >>>>>> Consider the case when application is going to be finalized and both >>>>>> event_base_loop() and event_base_loopbreak() are called in the same time >>>>>> in parallel threads. >>>>>> >>>>>> Then if event_base_loopbreak() happens to acquire lock first, it will >>>>>> set "event_base->event_break = 1", but won't send any signal to event >>>>>> loop, because it did not started yet. >>>>>> >>>>>> After that, event_base_loop() will acquire the lock and will clear >>>>>> event_break flag with the following statement: "base->event_gotterm = >>>>>> base->event_break = 0;". Then it will go into polling with timeout = -1 >>>>>> and thus block forever. >>>>>> >>>>>> This issue was reproduced on a custom compiler (using Lulesh benchmark >>>>>> and x86 4-core PC), but it can be also reproduced for me with GCC >>>>>> compiler (on almost any benchmark and in same HW settings) by putting >>>>>> some delay to orte_progress_thread_engine() function: >>>>>> >>>>>> static void* orte_progress_thread_engine(opal_object_t *obj) >>>>>> { >>>>>> while (orte_event_base_active) { >>>>>> usleep(1000); // add sleep to allow orte_ess_base_app_finalize() set >>>>>> orte_event_base_active flag to false >>>>>> opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE); >>>>>> } >>>>>> return OPAL_THREAD_CANCELLED; >>>>>> } >>>>>> >>>>>> I am not completely sure what should be the best fix for described >>>>>> problem. >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/01/26181.php >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/01/26185.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/01/26188.php >> >