The reason i'd like to disable these eager buffers is to help detect the
deadlock better. I would not run with this for a normal run but it
would be useful for debugging. If the deadlock is indeed due to our
code then disabling any shared buffers or eager sends would make that
deadlock reproduceable. In addition we might be able to lower the
number of processors down. Right now determining which processor is
deadlocks when we are using 8K cores and each processor has hundreds of
messages sent out would be quite difficult.
Thanks for your suggestions,
Justin
Brock Palen wrote:
OpenMPI has differnt eager limits for all the network types, on your
system run:
ompi_info --param btl all
and look for the eager_limits
You can set these values to 0 using the syntax I showed you before.
That would disable eager messages.
There might be a better way to disable eager messages.
Not sure why you would want to disable them, they are there for
performance.
Maybe you would still see a deadlock if every message was below the
threshold. I think there is a limit of the number of eager messages a
receving cpus will accept. Not sure about that though. I still kind
of doubt it though.
Try tweaking your buffer sizes, make the openib btl eager limit the
same as shared memory. and see if you get locks up between hosts and
not just shared memory.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Dec 5, 2008, at 2:10 PM, Justin wrote:
Thank you for this info. I should add that our code tends to post a
lot of sends prior to the other side posting receives. This causes a
lot of unexpected messages to exist. Our code explicitly matches up
all tags and processors (that is we do not use MPI wild cards). If
we had a dead lock I would think we would see it regardless of
weather or not we cross the roundevous threshold. I guess one way to
test this would be to to set this threshold to 0. If it then dead
locks we would likely be able to track down the deadlock. Are there
any other parameters we can send mpi that will turn off buffering?
Thanks,
Justin
Brock Palen wrote:
When ever this happens we found the code to have a deadlock. users
never saw it until they cross the eager->roundevous threshold.
Yes you can disable shared memory with:
mpirun --mca btl ^sm
Or you can try increasing the eager limit.
ompi_info --param btl sm
MCA btl: parameter "btl_sm_eager_limit" (current value:
"4096")
You can modify this limit at run time, I think (can't test it right
now) it is just:
mpirun --mca btl_sm_eager_limit 40960
I think you can also in tweaking these values use env Vars in place
of putting it all in the mpirun line:
export OMPI_MCA_btl_sm_eager_limit=40960
See:
http://www.open-mpi.org/faq/?category=tuning
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Dec 5, 2008, at 12:22 PM, Justin wrote:
Hi,
We are currently using OpenMPI 1.3 on Ranger for large processor
jobs (8K+). Our code appears to be occasionally deadlocking at
random within point to point communication (see stacktrace below).
This code has been tested on many different MPI versions and as far
as we know it does not contain a deadlock. However, in the past we
have ran into problems with shared memory optimizations within MPI
causing deadlocks. We can usually avoid these by setting a few
environment variables to either increase the size of shared memory
buffers or disable shared memory optimizations all together. Does
OpenMPI have any known deadlocks that might be causing our
deadlocks? If are there any work arounds? Also how do we disable
shared memory within OpenMPI?
Here is an example of where processors are hanging:
#0 0x00002b2df3522683 in mca_btl_sm_component_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
#1 0x00002b2df2cb46bf in mca_bml_r2_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
#2 0x00002b2df0032ea4 in opal_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0
#3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
#4 0x00002b2ded109e34 in PMPI_Waitsome () from
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
Thanks,
Justin
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users