FWIW, we solved this problem with ROMIO in MPICH2 by making the "big global lock" a recursive mutex. In the past it was implicitly so because of the way that recursive MPI calls were handled. In current MPICH2 it's explicitly initialized with type PTHREAD_MUTEX_RECURSIVE instead.
-Dave On Apr 4, 2011, at 9:28 AM CDT, Ralph Castain wrote: > > On Apr 4, 2011, at 8:18 AM, Rob Latham wrote: > >> On Sat, Apr 02, 2011 at 04:59:34PM -0400, fa...@email.com wrote: >>> >>> opal_mutex_lock(): Resource deadlock avoided >>> #0 0x0012e416 in __kernel_vsyscall () >>> #1 0x01035941 in raise (sig=6) at >>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64 >>> #2 0x01038e42 in abort () at abort.c:92 >>> #3 0x00d9da68 in ompi_attr_free_keyval (type=COMM_ATTR, key=0xbffda0e4, >>> predefined=0 '\000') at attribute/attribute.c:656 >>> #4 0x00dd8aa2 in PMPI_Keyval_free (keyval=0xbffda0e4) at pkeyval_free.c:52 >>> #5 0x01bf3e6a in ADIOI_End_call (comm=0xf1c0c0, keyval=10, >>> attribute_val=0x0, extra_state=0x0) at ad_end.c:82 >>> #6 0x00da01bb in ompi_attr_delete. (type=UNUSED_ATTR, object=0x6, >>> attr_hash=0x2c64, key=14285602, predefined=232 '\350', need_lock=128 >>> '\200') at attribute/attribute.c:726 >>> #7 0x00d9fb22 in ompi_attr_delete_all (type=COMM_ATTR, object=0xf1c0c0, >>> attr_hash=0x8d0fee8) at attribute/attribute.c:1043 >>> #8 0x00dbda65 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:133 >>> #9 0x00dd12c2 in PMPI_Finalize () at pfinalize.c:46 >>> #10 0x00d6b515 in mpi_finalize_f (ierr=0xbffda2b8) at pfinalize_f.c:62 >> >> I guess I need some OpenMPI eyeballs on this... >> >> ROMIO hooks into the attribute keyval deletion mechanism to clean up >> the internal data structures it has allocated. I suppose since this >> is MPI_Finalize, we could just leave those internal data structures >> alone and let the OS deal with it. >> >> What I see happening here is the OpenMPI finalize routine is deleting >> attributes. one of those attributes is ROMIO's, which in turn tries >> to free keyvals. Is the deadlock that noting "under" ompi_attr_delete >> can itself call ompi_* routines? (as ROMIO triggers a call to >> ompi_attr_free_keyval) ? >> >> Here's where ROMIO sets up the keyval and the delete handler: >> https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/romio/mpi-io/mpir-mpioinit.c#L39 >> >> that routine gets called upon any "MPI-IO entry point" (open, delete, >> register-datarep). The keyvals help ensure that ROMIO's internal >> structures get initialized exactly once, and the delete hooks help us >> be good citizens and clean up on exit. > > FWIW: his trace shows that OMPI incorrectly attempts to acquire a thread lock > that has already been locked. This occurs in OMPI's attribute code, probably > surrounding the call to your code. > > In other words, it looks to me like the problem is on our side, not yours. > Jeff is the one who generally handles the attribute code, though, so I'll > ping his eyeballs :-) > > >> >> ==rob >> >> -- >> Rob Latham >> Mathematics and Computer Science Division >> Argonne National Lab, IL USA >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users