Fixed - sorry about that!

> On Jan 15, 2015, at 10:39 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Ah, indeed - I found the problem. Fix coming momentarily
> 
>> On Jan 15, 2015, at 10:31 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Hmmm…I’m not seeing a failure. Let me try on another system.
>> 
>> 
>> Modifying libevent is not a viable solution :-(
>> 
>> 
>>> On Jan 15, 2015, at 10:26 AM, Leonid <lchis...@pathscale.com> wrote:
>>> 
>>> Hi Ralph.
>>> 
>>> Of course that may indicate an issue with custom compiler, but given that 
>>> it fails with gcc and inserted delay I still think it is a OMPI bug, since 
>>> such a delay could be caused by operating system at that exact point.
>>> 
>>> For me simply commenting out "base->event_gotterm = base->event_break = 0;" 
>>> seems to do the trick, but I am not completely sure if that won't cause any 
>>> other troubles.
>>> 
>>> I've tried to update my master branch to the latest version (including your 
>>> fix) but now it just crashes for me on *all* benchmarks that I am trying 
>>> (both with gcc and our compiler).
>>> 
>>> On 15.01.2015 18:57, Ralph Castain wrote:
>>>> Thought about this some more and realized that the orte progress engine 
>>>> wasn’t using the opal_progress_thread support functions, which include a 
>>>> “break” event to kick us out of just such problems. So I changed it on the 
>>>> master. From your citing of libevent 2.0.22, I believe that must be where 
>>>> you are working, yes?
>>>> 
>>>> If so, give the changed version a try and see if your problem is resolved.
>>>> 
>>>> 
>>>>> On Jan 15, 2015, at 12:55 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>> Given that you could only reproduce it with either your custom compiler 
>>>>> or by forcibly introducing a delay, is this indicating an issue with the 
>>>>> custom compiler? It does seem strange that we don't see this anywhere 
>>>>> else, given the number of times that code gets run.
>>>>> 
>>>>> Only alternative solution I can think of would be to push the finalize 
>>>>> request into the event loop, and thus execute the loopbreak from within 
>>>>> an event. You might try and see if that solves the problem.
>>>>> 
>>>>> 
>>>>>> On Jan 14, 2015, at 11:54 PM, Leonid <lchis...@pathscale.com> wrote:
>>>>>> 
>>>>>> Hi all.
>>>>>> 
>>>>>> I believe there is a bug in event_base_loop() function from file event.c 
>>>>>> (opal/mca/event/libevent2022/libevent/).
>>>>>> 
>>>>>> Consider the case when application is going to be finalized and both 
>>>>>> event_base_loop() and event_base_loopbreak() are called in the same time 
>>>>>> in parallel threads.
>>>>>> 
>>>>>> Then if event_base_loopbreak() happens to acquire lock first, it will 
>>>>>> set "event_base->event_break = 1", but won't send any signal to event 
>>>>>> loop, because it did not started yet.
>>>>>> 
>>>>>> After that, event_base_loop() will acquire the lock and will clear 
>>>>>> event_break flag with the following statement: "base->event_gotterm = 
>>>>>> base->event_break = 0;". Then it will go into polling with timeout = -1 
>>>>>> and thus block forever.
>>>>>> 
>>>>>> This issue was reproduced on a custom compiler (using Lulesh benchmark 
>>>>>> and x86 4-core PC), but it can be also reproduced for me with GCC 
>>>>>> compiler (on almost any benchmark and in same HW settings) by putting 
>>>>>> some delay to orte_progress_thread_engine() function:
>>>>>> 
>>>>>> static void* orte_progress_thread_engine(opal_object_t *obj)
>>>>>> {
>>>>>> while (orte_event_base_active) {
>>>>>>   usleep(1000); // add sleep to allow orte_ess_base_app_finalize() set 
>>>>>> orte_event_base_active flag to false
>>>>>>   opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
>>>>>> }
>>>>>> return OPAL_THREAD_CANCELLED;
>>>>>> }
>>>>>> 
>>>>>> I am not completely sure what should be the best fix for described 
>>>>>> problem.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2015/01/26181.php
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/01/26185.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/01/26188.php
>> 
> 

Reply via email to