[OMPI users] Deadlock on large numbers of processors

Justin Fri, 5 Dec 2008 12:22:41 -0500

Hi,

We are currently using OpenMPI 1.3 on Ranger for large processor jobs(8K+). Our code appears to be occasionally deadlocking at random withinpoint to point communication (see stacktrace below). This code has beentested on many different MPI versions and as far as we know it does notcontain a deadlock. However, in the past we have ran into problems withshared memory optimizations within MPI causing deadlocks. We canusually avoid these by setting a few environment variables to eitherincrease the size of shared memory buffers or disable shared memoryoptimizations all together. Does OpenMPI have any known deadlocks thatmight be causing our deadlocks? If are there any work arounds? Alsohow do we disable shared memory within OpenMPI?


Here is an example of where processors are hanging:

#0 0x00002b2df3522683 in mca_btl_sm_component_progress () from/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so#1 0x00002b2df2cb46bf in mca_bml_r2_progress () from/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so#2 0x00002b2df0032ea4 in opal_progress () from/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0#3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0#4 0x00002b2ded109e34 in PMPI_Waitsome () from/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0



Thanks,
Justin

[OMPI users] Deadlock on large numbers of processors

Reply via email to