also see https://svn.open-mpi.org/trac/ompi/ticket/1449
On 12/9/08, *Lenny Verkhovsky* <lenny.verkhov...@gmail.com <mailto:lenny.verkhov...@gmail.com
>> wrote:
maybe it's related to https://svn.open-mpi.org/trac/ompi/ticket/1378
??
On 12/5/08, *Justin* <luitj...@cs.utah.edu
<mailto:luitj...@cs.utah.edu>> wrote:
The reason i'd like to disable these eager buffers is to help
detect the deadlock better. I would not run with this for a
normal run but it would be useful for debugging. If the
deadlock is indeed due to our code then disabling any shared
buffers or eager sends would make that deadlock
reproduceable. In addition we might be able to lower the
number of processors
down. Right now determining which processor is deadlocks
when
we are using 8K cores and each processor has hundreds of
messages sent out would be quite difficult.
Thanks for your suggestions,
Justin
Brock Palen wrote:
OpenMPI has differnt eager limits for all the network
types,
on your system run:
ompi_info --param btl all
and look for the eager_limits
You can set these values to 0 using the syntax I showed
you
before. That would disable eager messages.
There might be a better way to disable eager messages.
Not sure why you would want to disable them, they are
there
for performance.
Maybe you would still see a deadlock if every message was
below the threshold. I think there is a limit of the
number
of eager messages a receving cpus will accept. Not sure
about that though. I still kind of doubt it though.
Try tweaking your buffer sizes, make the openib btl
eager
limit the same as shared memory. and see if you get
locks up
between hosts and not just shared memory.
Brock Palen
www.umich.edu/~brockp <http://www.umich.edu/~brockp>
Center for Advanced Computing
bro...@umich.edu <mailto:bro...@umich.edu>
(734)936-1985
On Dec 5, 2008, at 2:10 PM, Justin wrote:
Thank you for this info. I should add that our code
tends to post a lot of sends prior to the other side
posting receives. This causes a lot of unexpected
messages to exist. Our code explicitly matches up
all
tags and processors (that is we do not use MPI wild
cards). If we had a dead lock I would think we would
see it regardless of weather or not we cross the
roundevous threshold. I guess one way to test this
would be to to set this threshold to 0. If it then
dead
locks we would likely be able to track down the
deadlock. Are there any other parameters we can send
mpi that will turn off buffering?
Thanks,
Justin
Brock Palen wrote:
When ever this happens we found the code to
have a
deadlock. users never saw it until they cross
the
eager->roundevous threshold.
Yes you can disable shared memory with:
mpirun --mca btl ^sm
Or you can try increasing the eager limit.
ompi_info --param btl sm
MCA btl: parameter
"btl_sm_eager_limit" (current value:
"4096")
You can modify this limit at run time, I think
(can't test it right now) it is just:
mpirun --mca btl_sm_eager_limit 40960
I think you can also in tweaking these values use
env Vars in place of putting it all in the
mpirun line:
export OMPI_MCA_btl_sm_eager_limit=40960
See:
http://www.open-mpi.org/faq/?category=tuning
Brock Palen
www.umich.edu/~brockp <http://www.umich.edu/~brockp
>
Center for Advanced Computing
bro...@umich.edu <mailto:bro...@umich.edu>
(734)936-1985
On Dec 5, 2008, at 12:22 PM, Justin wrote:
Hi,
We are currently using OpenMPI 1.3 on
Ranger for
large processor jobs (8K+). Our code
appears to
be occasionally deadlocking at random within
point to point communication (see stacktrace
below). This code has been tested on many
different MPI versions and as far as we
know it
does not contain a deadlock. However, in the
past we have ran into problems with shared
memory optimizations within MPI causing
deadlocks. We can usually avoid these by
setting a few environment variables to either
increase the size of shared memory buffers or
disable shared memory optimizations all
together. Does OpenMPI have any known
deadlocks that might be causing our
deadlocks?
If are there any work arounds? Also how
do we
disable shared memory within OpenMPI?
Here is an example of where processors are
hanging:
#0 0x00002b2df3522683 in
mca_btl_sm_component_progress () from
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/
mca_btl_sm.so
#1 0x00002b2df2cb46bf in
mca_bml_r2_progress ()
from
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/
mca_bml_r2.so
#2 0x00002b2df0032ea4 in opal_progress ()
from
/opt/apps/intel10_1/openmpi/1.3/lib/libopen-
pal.so.0
#3 0x00002b2ded0d7622 in
ompi_request_default_wait_some () from
/opt/apps/intel10_1/openmpi/1.3//lib/
libmpi.so.0
#4 0x00002b2ded109e34 in PMPI_Waitsome ()
from
/opt/apps/intel10_1/openmpi/1.3//lib/
libmpi.so.0
Thanks,
Justin
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:users@open-
mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/
users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users