Hi,
We are currently using OpenMPI 1.3 on Ranger for large processor jobs
(8K+). Our code appears to be occasionally deadlocking at random within
point to point communication (see stacktrace below). This code has been
tested on many different MPI versions and as far as we know it does not
contain a deadlock. However, in the past we have ran into problems with
shared memory optimizations within MPI causing deadlocks. We can
usually avoid these by setting a few environment variables to either
increase the size of shared memory buffers or disable shared memory
optimizations all together. Does OpenMPI have any known deadlocks that
might be causing our deadlocks? If are there any work arounds? Also
how do we disable shared memory within OpenMPI?
Here is an example of where processors are hanging:
#0 0x00002b2df3522683 in mca_btl_sm_component_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
#1 0x00002b2df2cb46bf in mca_bml_r2_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
#2 0x00002b2df0032ea4 in opal_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0
#3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
#4 0x00002b2ded109e34 in PMPI_Waitsome () from
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
Thanks,
Justin