FWIW, we solved this problem with ROMIO in MPICH2 by making the "big global 
lock" a recursive mutex.  In the past it was implicitly so because of the way 
that recursive MPI calls were handled.  In current MPICH2 it's explicitly 
initialized with type PTHREAD_MUTEX_RECURSIVE instead.

-Dave

On Apr 4, 2011, at 9:28 AM CDT, Ralph Castain wrote:

> 
> On Apr 4, 2011, at 8:18 AM, Rob Latham wrote:
> 
>> On Sat, Apr 02, 2011 at 04:59:34PM -0400, fa...@email.com wrote:
>>> 
>>> opal_mutex_lock(): Resource deadlock avoided
>>> #0  0x0012e416 in __kernel_vsyscall ()
>>> #1  0x01035941 in raise (sig=6) at 
>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>>> #2  0x01038e42 in abort () at abort.c:92
>>> #3  0x00d9da68 in ompi_attr_free_keyval (type=COMM_ATTR, key=0xbffda0e4, 
>>> predefined=0 '\000') at attribute/attribute.c:656
>>> #4  0x00dd8aa2 in PMPI_Keyval_free (keyval=0xbffda0e4) at pkeyval_free.c:52
>>> #5  0x01bf3e6a in ADIOI_End_call (comm=0xf1c0c0, keyval=10, 
>>> attribute_val=0x0, extra_state=0x0) at ad_end.c:82
>>> #6  0x00da01bb in ompi_attr_delete. (type=UNUSED_ATTR, object=0x6, 
>>> attr_hash=0x2c64, key=14285602, predefined=232 '\350', need_lock=128 
>>> '\200') at attribute/attribute.c:726
>>> #7  0x00d9fb22 in ompi_attr_delete_all (type=COMM_ATTR, object=0xf1c0c0, 
>>> attr_hash=0x8d0fee8) at attribute/attribute.c:1043
>>> #8  0x00dbda65 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:133
>>> #9  0x00dd12c2 in PMPI_Finalize () at pfinalize.c:46
>>> #10 0x00d6b515 in mpi_finalize_f (ierr=0xbffda2b8) at pfinalize_f.c:62
>> 
>> I guess I need some OpenMPI eyeballs on this...
>> 
>> ROMIO hooks into the attribute keyval deletion mechanism to clean up
>> the internal data structures it has allocated.  I suppose since this
>> is MPI_Finalize, we could just leave those internal data structures
>> alone and let the OS deal with it. 
>> 
>> What I see happening here is the OpenMPI finalize routine is deleting
>> attributes.   one of those attributes is ROMIO's, which in turn tries
>> to free keyvals.  Is the deadlock that noting "under" ompi_attr_delete
>> can itself call ompi_* routines? (as ROMIO triggers a call to
>> ompi_attr_free_keyval) ?
>> 
>> Here's where ROMIO sets up the keyval and the delete handler:
>> https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/romio/mpi-io/mpir-mpioinit.c#L39
>> 
>> that routine gets called upon any "MPI-IO entry point" (open, delete,
>> register-datarep).  The keyvals help ensure that ROMIO's internal
>> structures get initialized exactly once, and the delete hooks help us
>> be good citizens and clean up on exit. 
> 
> FWIW: his trace shows that OMPI incorrectly attempts to acquire a thread lock 
> that has already been locked. This occurs  in OMPI's attribute code, probably 
> surrounding the call to your code.
> 
> In other words, it looks to me like the problem is on our side, not yours. 
> Jeff is the one who generally handles the attribute code, though, so I'll 
> ping his eyeballs :-)
> 
> 
>> 
>> ==rob
>> 
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to