Hi Ben,

Would you mind checking if you still observe this deadlock condition if you
use
the 1.8.4 rc4 candidate?

openmpi-1.8.4rc4.tar.gz
<http://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.4rc4.tar.gz>

I realize the behavior will likely be the same, but this is just to double
check.

The Open MPI man page for MPI_Attr_get (hmm.. no MPI_Comm_attr_get man page,
needs to be fixed) says nothing about issues with recursion with respect to
invoking
this function within an attribute delete callback, so I would treat this as
a bug.

Thanks for your patience,

Howard


2014-12-17 17:07 GMT-07:00 Ben Menadue <ben.mena...@nci.org.au>:
>
> Hi PETSc and OpenMPI teams,
>
> I'm running into a deadlock in PETSc 3.4.5 with OpenMPI 1.8.3:
>
>  1. PetscCommDestroy calls MPI_Attr_delete
>  2. MPI_Attr_delete acquires a lock
>  3. MPI_Attr_delete calls Petsc_DelComm_Outer (through a callback)
>  4. Petsc_DelComm_Outer calls MPI_Attr_get
>  5. MPI_Attr_get wants to also acquire the lock from step 2.
>
> Looking at the OpenMPI source code, it looks like you can't call an
> MPI_Attr_* function from inside the registered deletion callback. The
> OpenMPI source code notes that all of the functions acquire a global lock,
> which is where the problem is coming from - here are the comments and the
> lock definition, in ompi/attribute/attribute.c of OpenMPI 1.8.3:
>
>     404 /*
>     405  * We used to have multiple locks for semi-fine-grained locking.
> But
>     406  * the code got complex, and we had to spend time looking for
> subtle
>     407  * bugs.  Craziness -- MPI attributes are *not* high performance,
> so
>     408  * just use a One Big Lock approach: there is *no* concurrent
> access.
>     409  * If you have the lock, you can do whatever you want and no data
> will
>     410  * change/disapear from underneath you.
>     411  */
>     412 static opal_mutex_t attribute_lock;
>
> To get it to work, I had to modify the definition of this lock to use a
> recursive mutex:
>
>     412 static opal_mutex_t attribute_lock = { .m_lock_pthread =
> PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP };
>
> but this is non-portable.
>
> Is the behaviour expected from new versions OpenMPI? In which case a new
> approach might be needed in PETSc. Otherwise, maybe a per-attribute lock is
> needed in OpenMPI - but I'm not sure whether the get in the callback is on
> the same attribute as is being deleted.
>
> Thanks,
> Ben
>
> #0  0x00007fd7d5de4264 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fd7d5ddf508 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x00007fd7d5ddf3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007fd7d27d91bc in ompi_attr_get_c () from
> /apps/openmpi/1.8.3/lib/libmpi.so.1
> #4  0x00007fd7d2803f03 in PMPI_Attr_get () from
> /apps/openmpi/1.8.3/lib/libmpi.so.1
> #5  0x00007fd7d7716006 in Petsc_DelComm_Outer (comm=0x7fd7d2a83b30,
> keyval=128, attr_val=0x7fff00a20f00, extra_state=0xffffffffffffffff) at
> pinit.c:406
> #6  0x00007fd7d27d8cad in ompi_attr_delete_impl () from
> /apps/openmpi/1.8.3/lib/libmpi.so.1
> #7  0x00007fd7d27d8f2f in ompi_attr_delete () from
> /apps/openmpi/1.8.3/lib/libmpi.so.1
> #8  0x00007fd7d2803dfc in PMPI_Attr_delete () from
> /apps/openmpi/1.8.3/lib/libmpi.so.1
> #9  0x00007fd7d78bf5c5 in PetscCommDestroy (comm=0x7fd7d2a83b30) at
> tagm.c:256
> #10 0x00007fd7d7506f58 in PetscHeaderDestroy_Private (h=0x7fd7d2a83b30) at
> inherit.c:114
> #11 0x00007fd7d75038a0 in ISDestroy (is=0x7fd7d2a83b30) at index.c:225
> #12 0x00007fd7d75029b7 in PCReset_ILU (pc=0x7fd7d2a83b30) at ilu.c:42
> #13 0x00007fd7d77a9baa in PCReset (pc=0x7fd7d2a83b30) at precon.c:81
> #14 0x00007fd7d77a99ae in PCDestroy (pc=0x7fd7d2a83b30) at precon.c:117
> #15 0x00007fd7d7557c1a in KSPDestroy (ksp=0x7fd7d2a83b30) at itfunc.c:788
> #16 0x00007fd7d91cdcca in linearSystemPETSc<double>::~linearSystemPETSc
> (this=0x7fd7d2a83b30) at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Solver/li
> nearSystemPETSc.hpp:73
> #17 0x00007fd7d8ddb63b in GFaceCompound::parametrize (this=0x7fd7d2a83b30,
> step=128, tom=10620672) at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GFace
> Compound.cpp:1672
> #18 0x00007fd7d8dda0fe in GFaceCompound::parametrize (this=0x7fd7d2a83b30)
> at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GFace
> Compound.cpp:916
> #19 0x00007fd7d8f98b0e in checkMeshCompound (gf=0x7fd7d2a83b30, edges=...)
> at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh
> GFace.cpp:2588
> #20 0x00007fd7d8f95c7e in meshGenerator (gf=0xd13020, RECUR_ITER=0,
> repairSelfIntersecting1dMesh=true, onlyInitialMesh=false, debug=false,
> replacement_edges=0x0)
>     at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh
> GFace.cpp:1075
> #21 0x00007fd7d8f9a41e in meshGFace::operator() (this=0x7fd7d2a83b30,
> gf=0x80, print=false) at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/mesh
> GFace.cpp:2562
> #22 0x00007fd7d8f8c327 in Mesh2D (m=0x7fd7d2a83b30) at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/Gene
> rator.cpp:407
> #23 0x00007fd7d8f8ad0b in GenerateMesh (m=0x7fd7d2a83b30, ask=128) at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Mesh/Gene
> rator.cpp:641
> #24 0x00007fd7d8e43126 in GModel::mesh (this=0x7fd7d2a83b30, dimension=128)
> at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Geo/GMode
> l.cpp:535
> #25 0x00007fd7d8c1acd2 in GmshBatch () at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Common/Gm
> sh.cpp:240
> #26 0x000000000040187a in main (argc=-760726736, argv=0x80) at
>
> /short/z00/bjm900/build/fluidity/intel15-ompi183/gmsh-2.8.5-source/Common/Ma
> in.cpp:27
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/12/26018.php
>

Reply via email to