Hi Ben, Would you mind checking if you still observe this deadlock condition if you use the 1.8.4 rc4 candidate?
openmpi-1.8.4rc4.tar.gz <http://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.4rc4.tar.gz> I realize the behavior will likely be the same, but this is just to double check. The Open MPI man page for MPI_Attr_get (hmm.. no MPI_Comm_attr_get man page, needs to be fixed) says nothing about issues with recursion with respect to invoking this function within an attribute delete callback, so I would treat this as a bug. Thanks for your patience, Howard 2014-12-17 17:07 GMT-07:00 Ben Menadue <ben.mena...@nci.org.au>: > > Hi PETSc and OpenMPI teams, > > I'm running into a deadlock in PETSc 3.4.5 with OpenMPI 1.8.3: > > 1. PetscCommDestroy calls MPI_Attr_delete > 2. MPI_Attr_delete acquires a lock > 3. MPI_Attr_delete calls Petsc_DelComm_Outer (through a callback) > 4. Petsc_DelComm_Outer calls MPI_Attr_get > 5. MPI_Attr_get wants to also acquire the lock from step 2. > > Looking at the OpenMPI source code, it looks like you can't call an > MPI_Attr_* function from inside the registered deletion callback. The > OpenMPI source code notes that all of the functions acquire a global lock, > which is where the problem is coming from - here are the comments and the > lock definition, in ompi/attribute/attribute.c of OpenMPI 1.8.3: > > 404 /* > 405 * We used to have multiple locks for semi-fine-grained locking. > But > 406 * the code got complex, and we had to spend time looking for > subtle > 407 * bugs. Craziness -- MPI attributes are *not* high performance, > so > 408 * just use a One Big Lock approach: there is *no* concurrent > access. > 409 * If you have the lock, you can do whatever you want and no data > will > 410 * change/disapear from underneath you. > 411 */ > 412 static opal_mutex_t attribute_lock; > > To get it to work, I had to modify the definition of this lock to use a > recursive mutex: > > 412 static opal_mutex_t attribute_lock = { .m_lock_pthread = > PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP }; > > but this is non-portable. > > Is the behaviour expected from new versions OpenMPI? In which case a new > approach might be needed in PETSc. Otherwise, maybe a per-attribute lock is > needed in OpenMPI - but I'm not sure whether the get in the callback is on > the same attribute as is being deleted. > > Thanks, > Ben > > #0 0x00007fd7d5de4264 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x00007fd7d5ddf508 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x00007fd7d5ddf3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x00007fd7d27d91bc in ompi_attr_get_c () from > /apps/openmpi/1.8.3/lib/libmpi.so.1 > #4 0x00007fd7d2803f03 in PMPI_Attr_get () from > /apps/openmpi/1.8.3/lib/libmpi.so.1 > #5 0x00007fd7d7716006 in Petsc_DelComm_Outer (comm=0x7fd7d2a83b30, > keyval=128, attr_val=0x7fff00a20f00, extra_state=0xffffffffffffffff) at > pinit.c:406 > #6 0x00007fd7d27d8cad in ompi_attr_delete_impl () from > /apps/openmpi/1.8.3/lib/libmpi.so.1 > #7 0x00007fd7d27d8f2f in ompi_attr_delete () from > /apps/openmpi/1.8.3/lib/libmpi.so.1 > #8 0x00007fd7d2803dfc in PMPI_Attr_delete () from > /apps/openmpi/1.8.3/lib/libmpi.so.1 > #9 0x00007fd7d78bf5c5 in PetscCommDestroy (comm=0x7fd7d2a83b30) at > tagm.c:256 > #10 0x00007fd7d7506f58 in PetscHeaderDestroy_Private (h=0x7fd7d2a83b30) at > inherit.c:114 > #11 0x00007fd7d75038a0 in ISDestroy (is=0x7fd7d2a83b30) at index.c:225 > #12 0x00007fd7d75029b7 in PCReset_ILU (pc=0x7fd7d2a83b30) at ilu.c:42 > #13 0x00007fd7d77a9baa in PCReset (pc=0x7fd7d2a83b30) at precon.c:81 > #14 0x00007fd7d77a99ae in PCDestroy (pc=0x7fd7d2a83b30) at precon.c:117 > #15 0x00007fd7d7557c1a in KSPDestroy (ksp=0x7fd7d2a83b30) at itfunc.c:788 > #16 0x00007fd7d91cdcca in linearSystemPETSc<double>::~linearSystemPETSc > (this=0x7fd7d2a83b30) at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Solver/li > nearSystemPETSc.hpp:73 > #17 0x00007fd7d8ddb63b in GFaceCompound::parametrize (this=0x7fd7d2a83b30, > step=128, tom=10620672) at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GFace > Compound.cpp:1672 > #18 0x00007fd7d8dda0fe in GFaceCompound::parametrize (this=0x7fd7d2a83b30) > at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GFace > Compound.cpp:916 > #19 0x00007fd7d8f98b0e in checkMeshCompound (gf=0x7fd7d2a83b30, edges=...) > at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh > GFace.cpp:2588 > #20 0x00007fd7d8f95c7e in meshGenerator (gf=0xd13020, RECUR_ITER=0, > repairSelfIntersecting1dMesh=true, onlyInitialMesh=false, debug=false, > replacement_edges=0x0) > at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh > GFace.cpp:1075 > #21 0x00007fd7d8f9a41e in meshGFace::operator() (this=0x7fd7d2a83b30, > gf=0x80, print=false) at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh > GFace.cpp:2562 > #22 0x00007fd7d8f8c327 in Mesh2D (m=0x7fd7d2a83b30) at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/Gene > rator.cpp:407 > #23 0x00007fd7d8f8ad0b in GenerateMesh (m=0x7fd7d2a83b30, ask=128) at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/Gene > rator.cpp:641 > #24 0x00007fd7d8e43126 in GModel::mesh (this=0x7fd7d2a83b30, dimension=128) > at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GMode > l.cpp:535 > #25 0x00007fd7d8c1acd2 in GmshBatch () at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Common/Gm > sh.cpp:240 > #26 0x000000000040187a in main (argc=-760726736, argv=0x80) at > > /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Common/Ma > in.cpp:27 > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/26018.php >