Ah, indeed - I found the problem. Fix coming momentarily
> On Jan 15, 2015, at 10:31 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Hmmm…I’m not seeing a failure. Let me try on another system.
>
>
> Modifying libevent is not a viable solution :-(
>
>
>> On Jan 15, 2015, at 10:26 AM, Leonid <lchis...@pathscale.com> wrote:
>>
>> Hi Ralph.
>>
>> Of course that may indicate an issue with custom compiler, but given that it
>> fails with gcc and inserted delay I still think it is a OMPI bug, since such
>> a delay could be caused by operating system at that exact point.
>>
>> For me simply commenting out "base->event_gotterm = base->event_break = 0;"
>> seems to do the trick, but I am not completely sure if that won't cause any
>> other troubles.
>>
>> I've tried to update my master branch to the latest version (including your
>> fix) but now it just crashes for me on *all* benchmarks that I am trying
>> (both with gcc and our compiler).
>>
>> On 15.01.2015 18:57, Ralph Castain wrote:
>>> Thought about this some more and realized that the orte progress engine
>>> wasn’t using the opal_progress_thread support functions, which include a
>>> “break” event to kick us out of just such problems. So I changed it on the
>>> master. From your citing of libevent 2.0.22, I believe that must be where
>>> you are working, yes?
>>>
>>> If so, give the changed version a try and see if your problem is resolved.
>>>
>>>
>>>> On Jan 15, 2015, at 12:55 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Given that you could only reproduce it with either your custom compiler or
>>>> by forcibly introducing a delay, is this indicating an issue with the
>>>> custom compiler? It does seem strange that we don't see this anywhere
>>>> else, given the number of times that code gets run.
>>>>
>>>> Only alternative solution I can think of would be to push the finalize
>>>> request into the event loop, and thus execute the loopbreak from within an
>>>> event. You might try and see if that solves the problem.
>>>>
>>>>
>>>>> On Jan 14, 2015, at 11:54 PM, Leonid <lchis...@pathscale.com> wrote:
>>>>>
>>>>> Hi all.
>>>>>
>>>>> I believe there is a bug in event_base_loop() function from file event.c
>>>>> (opal/mca/event/libevent2022/libevent/).
>>>>>
>>>>> Consider the case when application is going to be finalized and both
>>>>> event_base_loop() and event_base_loopbreak() are called in the same time
>>>>> in parallel threads.
>>>>>
>>>>> Then if event_base_loopbreak() happens to acquire lock first, it will set
>>>>> "event_base->event_break = 1", but won't send any signal to event loop,
>>>>> because it did not started yet.
>>>>>
>>>>> After that, event_base_loop() will acquire the lock and will clear
>>>>> event_break flag with the following statement: "base->event_gotterm =
>>>>> base->event_break = 0;". Then it will go into polling with timeout = -1
>>>>> and thus block forever.
>>>>>
>>>>> This issue was reproduced on a custom compiler (using Lulesh benchmark
>>>>> and x86 4-core PC), but it can be also reproduced for me with GCC
>>>>> compiler (on almost any benchmark and in same HW settings) by putting
>>>>> some delay to orte_progress_thread_engine() function:
>>>>>
>>>>> static void* orte_progress_thread_engine(opal_object_t *obj)
>>>>> {
>>>>> while (orte_event_base_active) {
>>>>> usleep(1000); // add sleep to allow orte_ess_base_app_finalize() set
>>>>> orte_event_base_active flag to false
>>>>> opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
>>>>> }
>>>>> return OPAL_THREAD_CANCELLED;
>>>>> }
>>>>>
>>>>> I am not completely sure what should be the best fix for described
>>>>> problem.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2015/01/26181.php
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/01/26185.php
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/01/26188.php
>