Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

Thomas Klimpel Thu, 9 Apr 2015 16:26:54 -0400 (EDT)

I tried 1.8.5rc1 now. It behaves very similar to 1.8.4 from my point of
view (and completely different from 1.6.5). The warning
[warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one
event_base_loop can run on each event_base at once.
is still there.


It's easy for me to (re)produce a deadlock with both 1.8.4 and 1.8.5rc1.
With 1.8.5rc1, I sometimes even get the deadlock without the warning. The
following seems crucial for reproducing the deadlock

1) start a worker on the same node as the master
2) chop big messages into 1k blocks. With 2k blocks, the deadlocks become
rarer, and with 4k blocks (or no choping at all), the deadlocks seem to be
gone.

the deadlock happens even with a single worker

#0  0x000000363f20e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000363f209388 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x000000363f209257 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f9901d47343 in mca_btl_vader_component_progress () from
/homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/openmpi/mca_btl_vader.so
#4  0x00007f9910a9b49a in opal_progress () from
/homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/libopen-pal.so.6
#5  0x00007f990170594d in mca_pml_ob1_send () from
/homes/data/public/Development/3rdParty/install/openmpi-1.8.5rc1/Linux-x86_64-redhat.6.3/M64/lib/openmpi/mca_pml_ob1.so

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

Reply via email to