This is not fixed in the trunk. At this time MPI_THREAD_MULTIPLE will always 
hang (though
there may be some configurations that don't.) The problem is when multiple 
threads are
active opal_condition_wait ALWAYS blocks on a condition variable instead of 
calling
opal_progress(). Thus we will not progress since opal_progress() will never be 
called in
the MPI_Wait_all/any paths. This will probably be fixed as we address other 
threading issues
over the next several months. It is unlikely this will happen in time for 1.8.0.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Wed, Dec 04, 2013 at 12:13:25PM -0500, Dominique Orban wrote:
> I built the 1.7.x nightly tar ball on 10.8 (Mountain Lion) and 10.9 
> (Mavericks) and it still hangs. I tried compiling with 
> --enable-mpi-thread-multiple only and with the other options Pierre 
> mentioned. The PETSc tests hang in both cases.
> 
> I'm curious to know if the nightly tar ball fixes the issue for other users.
> 
> 
> On 2013-12-02, at 6:40 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
> 
> > I'm joining this thread late, but I think I know what is going on:
> > 
> > - I am able to replicate the hang with 1.7.3 on Mavericks (with threading 
> > enabled, etc.)
> > - I notice that the hang has disappeared at the 1.7.x branch head (also on 
> > Mavericks)
> > 
> > Meaning: can you try with the latest 1.7.x nightly tarball and verify that 
> > the problem disappears for you?  See http://www.open-mpi.org/nightly/v1.7/
> > 
> > Ralph recently brought over a major ORTE control message change to the 
> > 1.7.x branch (after 1.7.3 was released) that -- skipping lots of details -- 
> > changes how the shared memory bootstrapping works.  Based on the stack 
> > traces you sent and the ones I was also able to get, I'm thinking that 
> > Ralph's big ORTE change fixes this issue.
> > 
> > 
> > 
> > On Nov 25, 2013, at 10:52 PM, Dominique Orban <dominique.or...@gmail.com> 
> > wrote:
> > 
> >> 
> >> On 2013-11-25, at 9:02 PM, Ralph Castain <rhc.open...@gmail.com> wrote:
> >> 
> >>> On Nov 25, 2013, at 5:04 PM, Pierre Jolivet <joli...@ann.jussieu.fr> 
> >>> wrote:
> >>> 
> >>>> 
> >>>> On Nov 24, 2013, at 3:03 PM, Jed Brown <jedbr...@mcs.anl.gov> wrote:
> >>>> 
> >>>>> Ralph Castain <r...@open-mpi.org> writes:
> >>>>> 
> >>>>>> Given that we have no idea what Homebrew uses, I don't know how we
> >>>>>> could clarify/respond.
> >>>>> 
> >>>> 
> >>>> Ralph, it is pretty easy to know what Homebrew uses, c.f. 
> >>>> https://github.com/mxcl/homebrew/blob/master/Library/Formula/open-mpi.rb 
> >>>> (sorry if you meant something else).
> >>> 
> >>> Might be a surprise, but I don't track all these guys :-)
> >>> 
> >>> Homebrew is new to me
> >>> 
> >>>> 
> >>>>> Pierre provided a link to MacPorts saying that all of the following
> >>>>> options were needed to properly enable threads.
> >>>>> 
> >>>>> --enable-event-thread-support --enable-opal-multi-threads 
> >>>>> --enable-orte-progress-threads --enable-mpi-thread-multiple
> >>>>> 
> >>>>> If that is indeed the case, and if passing some subset of these options
> >>>>> results in deadlock, it's not exactly user-friendly.
> >>>>> 
> >>>>> Maybe --enable-mpi-thread-multiple is enough, in which case MacPorts is
> >>>>> doing something needlessly complicated and Pierre's link was a red
> >>>>> herring?
> >>>> 
> >>>> That is very likely, though on the other hand, Homebrew is doing 
> >>>> something pretty straightforward. I just wanted a quick and easy fix 
> >>>> back when I had the same hanging issue, but there should be a better 
> >>>> explanation if --enable-mpi-thread-multiple is indeed enough.
> >>> 
> >>> It is enough - we set all required things internally
> >> 
> >> Is that for sure? My original message originates from a hang in the PETSc 
> >> tests and I get quite different results depending on whether I compile 
> >> OpenMPI with --enable-mpi-thread-multiple only or not.
> >> 
> >> I recompiled PETSc with debugging enabled against OpenMPI built with the 
> >> "correct" flags mentioned by Pierre, and this the stack trace I get:
> >> 
> >> $ mpirun -n 2 xterm -e gdb ./ex5
> >> 
> >>    ^C
> >>    Program received signal SIGINT, Interrupt.
> >>    0x00007fff991160fa in __psynch_cvwait ()
> >>       from /usr/lib/system/libsystem_kernel.dylib
> >>    (gdb) where
> >>    #0  0x00007fff991160fa in __psynch_cvwait ()
> >>       from /usr/lib/system/libsystem_kernel.dylib
> >>    #1  0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
> >> 
> >> 
> >>    ^C
> >>    Program received signal SIGINT, Interrupt.
> >>    0x00007fff991160fa in __psynch_cvwait ()
> >>       from /usr/lib/system/libsystem_kernel.dylib
> >>    (gdb) where
> >>    #0  0x00007fff991160fa in __psynch_cvwait ()
> >>       from /usr/lib/system/libsystem_kernel.dylib
> >>    #1  0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
> >> 
> >> 
> >> If I recompile PETSc against OpenMPI built with 
> >> --enable-mpi-thread-multiple only (leaving out the other flags, which 
> >> Pierre suggested is wrong), I get the following traces:
> >> 
> >>    ^C
> >>    Program received signal SIGINT, Interrupt.
> >>    0x00007fff991160fa in __psynch_cvwait ()
> >>       from /usr/lib/system/libsystem_kernel.dylib
> >>    (gdb) where
> >>    #0  0x00007fff991160fa in __psynch_cvwait ()
> >>       from /usr/lib/system/libsystem_kernel.dylib
> >>    #1  0x00007fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
> >> 
> >> 
> >>    ^C
> >>    Program received signal SIGINT, Interrupt.
> >>    0x0000000101edca28 in mca_common_sm_init ()
> >>       from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
> >>    (gdb) where
> >>    #0  0x0000000101edca28 in mca_common_sm_init ()
> >>       from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
> >>    #1  0x0000000101ed8a38 in mca_mpool_sm_init ()
> >>       from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_mpool_sm.so
> >>    #2  0x0000000101c383fa in mca_mpool_base_module_create ()
> >>       from /usr/local/lib/libmpi.1.dylib
> >>    #3  0x0000000102933b41 in mca_btl_sm_add_procs ()
> >>       from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_btl_sm.so
> >>    #4  0x0000000102929dfb in mca_bml_r2_add_procs ()
> >>       from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_bml_r2.so
> >>    #5  0x000000010290a59c in mca_pml_ob1_add_procs ()
> >>       from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_pml_ob1.so
> >>    #6  0x0000000101bd859b in ompi_mpi_init () from 
> >> /usr/local/lib/libmpi.1.dylib
> >>    #7  0x0000000101bf24da in MPI_Init_thread () from 
> >> /usr/local/lib/libmpi.1.dylib
> >>    #8  0x00000001000724db in PetscInitialize (argc=0x7fff5fbfed48, 
> >>        args=0x7fff5fbfed40, file=0x0, 
> >>        help=0x1000061c0 "Bratu nonlinear PDE in 2d.\nWe solve the  Bratu 
> >> (SFI - soid fuel ignition) problem in a 2D rectangular\ndomain, using 
> >> distributed arrays(DMDAs) to partition the parallel grid.\nThe command 
> >> line options"...)
> >>        at /tmp/petsc-3.4.3/src/sys/objects/pinit.c:675
> >>    #9  0x0000000100000d8c in main ()
> >> 
> >> 
> >> Line 675 of pinit.c is
> >> 
> >>    ierr = 
> >> MPI_Init_thread(argc,args,MPI_THREAD_FUNNELED,&provided);CHKERRQ(ierr);
> >> 
> >> 
> >> Dominique
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> Dominique
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Attachment: pgpJDMVUUOhiK.pgp
Description: PGP signature

Reply via email to