Re: [OMPI users] Deadlock on large numbers of processors

Justin Fri, 5 Dec 2008 14:10:39 -0500

Thank you for this info. I should add that our code tends to post a lotof sends prior to the other side posting receives. This causes a lot ofunexpected messages to exist. Our code explicitly matches up all tagsand processors (that is we do not use MPI wild cards). If we had a deadlock I would think we would see it regardless of weather or not we crossthe roundevous threshold. I guess one way to test this would be to toset this threshold to 0. If it then dead locks we would likely be ableto track down the deadlock. Are there any other parameters we can sendmpi that will turn off buffering?


Thanks,
Justin


Brock Palen wrote:

When ever this happens we found the code to have a deadlock. usersnever saw it until they cross the eager->roundevous threshold.
Yes you can disable shared memory with:

mpirun --mca btl ^sm

Or you can try increasing the eager limit.

ompi_info --param btl sm

MCA btl: parameter "btl_sm_eager_limit" (current value:
                          "4096")
You can modify this limit at run time, I think (can't test it rightnow) it is just:
mpirun --mca btl_sm_eager_limit 40960
I think you can also in tweaking these values use env Vars in place ofputting it all in the mpirun line:
export OMPI_MCA_btl_sm_eager_limit=40960

See:
http://www.open-mpi.org/faq/?category=tuning


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Dec 5, 2008, at 12:22 PM, Justin wrote:
Hi,
We are currently using OpenMPI 1.3 on Ranger for large processor jobs(8K+). Our code appears to be occasionally deadlocking at randomwithin point to point communication (see stacktrace below). Thiscode has been tested on many different MPI versions and as far as weknow it does not contain a deadlock. However, in the past we haveran into problems with shared memory optimizations within MPI causingdeadlocks. We can usually avoid these by setting a few environmentvariables to either increase the size of shared memory buffers ordisable shared memory optimizations all together. Does OpenMPI haveany known deadlocks that might be causing our deadlocks? If arethere any work arounds? Also how do we disable shared memory withinOpenMPI?
Here is an example of where processors are hanging:
#0 0x00002b2df3522683 in mca_btl_sm_component_progress () from/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so#1 0x00002b2df2cb46bf in mca_bml_r2_progress () from/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so#2 0x00002b2df0032ea4 in opal_progress () from/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0#3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0#4 0x00002b2ded109e34 in PMPI_Waitsome () from/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
Thanks,
Justin
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Deadlock on large numbers of processors

Reply via email to