Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Jeff Squyres
Cross your fingers; we might release tomorrow (I've probably now jinxed it by saying that!). On Jan 12, 2009, at 1:54 PM, Justin wrote: In order for me to test this out I need to wait for TACC to install this version on Ranger. Right now they have version 1.3a1r19685 installed. I'm gues

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Justin
In order for me to test this out I need to wait for TACC to install this version on Ranger. Right now they have version 1.3a1r19685 installed. I'm guessing this is probably an older version. I'm not sure when TACC will get around to updating there OpenMPI version. I could request them to u

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Jeff Squyres
Justin -- Could you actually give your code a whirl with 1.3rc3 to ensure that it fixes the problem for you? http://www.open-mpi.org/software/ompi/v1.3/ On Jan 12, 2009, at 1:30 PM, Tim Mattox wrote: Hi Justin, I applied the fixes for this particular deadlock to the 1.3 code base late

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Tim Mattox
Hi Justin, I applied the fixes for this particular deadlock to the 1.3 code base late last week, see ticket #1725: https://svn.open-mpi.org/trac/ompi/ticket/1725 This should fix the described problem, but I personally have not tested to see if the deadlock in question is now gone. Everyone should

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Justin
Hi, has this deadlock been fixed in the 1.3 source yet? Thanks, Justin Jeff Squyres wrote: On Dec 11, 2008, at 5:30 PM, Justin wrote: The more I look at this bug the more I'm convinced it is with openMPI and not our code. Here is why: Our code generates a communication/execution schedul

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-11 Thread Jeff Squyres
On Dec 11, 2008, at 5:30 PM, Justin wrote: The more I look at this bug the more I'm convinced it is with openMPI and not our code. Here is why: Our code generates a communication/execution schedule. At each timestep this schedule is executed and all communication and execution is perform

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-11 Thread Justin
The more I look at this bug the more I'm convinced it is with openMPI and not our code. Here is why: Our code generates a communication/execution schedule. At each timestep this schedule is executed and all communication and execution is performed. Our problem is AMR which means the communi

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-11 Thread Jeff Squyres
George -- Is this the same issue that you're working on? (we have a "blocker" bug for v1.3 about deadlock at heavy messaging volume -- on Tuesday, it looked like a bug in our freelist...) On Dec 9, 2008, at 10:28 AM, Justin wrote: I have tried disabling the shared memory by running with th

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Justin
I have tried disabling the shared memory by running with the following parameters to mpirun --mca btl openib,self --mca btl_openib_ib_timeout 23 --mca btl_openib_use_srq 1 --mca btl_openib_use_rd_max 2048 Unfortunately this did not get rid of any hangs and has seemed to make them more common

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Rolf Vandevaart
The current version of Open MPI installed on ranger is 1.3a1r19685 which is from early October. This version has a fix for ticket #1378. Ticket #1449 is not an issue is this case because each node has 16 processors and #1449 is for larger SMPs. However, I am wondering if this is because of

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Lenny Verkhovsky
also see https://svn.open-mpi.org/trac/ompi/ticket/1449 On 12/9/08, Lenny Verkhovsky wrote: > > maybe it's related to https://svn.open-mpi.org/trac/ompi/ticket/1378 ?? > > On 12/5/08, Justin wrote: >> >> The reason i'd like to disable these eager buffers is to help detect the >> deadlock bett

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Lenny Verkhovsky
maybe it's related to https://svn.open-mpi.org/trac/ompi/ticket/1378 ?? On 12/5/08, Justin wrote: > > The reason i'd like to disable these eager buffers is to help detect the > deadlock better. I would not run with this for a normal run but it would be > useful for debugging. If the deadlock i

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Justin
The reason i'd like to disable these eager buffers is to help detect the deadlock better. I would not run with this for a normal run but it would be useful for debugging. If the deadlock is indeed due to our code then disabling any shared buffers or eager sends would make that deadlock reprod

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Brock Palen
OpenMPI has differnt eager limits for all the network types, on your system run: ompi_info --param btl all and look for the eager_limits You can set these values to 0 using the syntax I showed you before. That would disable eager messages. There might be a better way to disable eager messag

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Justin
Thank you for this info. I should add that our code tends to post a lot of sends prior to the other side posting receives. This causes a lot of unexpected messages to exist. Our code explicitly matches up all tags and processors (that is we do not use MPI wild cards). If we had a dead lock

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Brock Palen
When ever this happens we found the code to have a deadlock. users never saw it until they cross the eager->roundevous threshold. Yes you can disable shared memory with: mpirun --mca btl ^sm Or you can try increasing the eager limit. ompi_info --param btl sm MCA btl: parameter "btl_sm_eage

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Scott Atchley
On Dec 5, 2008, at 12:22 PM, Justin wrote: Does OpenMPI have any known deadlocks that might be causing our deadlocks? Known deadlocks, no. We are assisting a customer, however, with a deadlock that occurs in IMB Alltoall (and some other IMB tests) when using 128 hosts and the MX BTL. We h

[OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Justin
Hi, We are currently using OpenMPI 1.3 on Ranger for large processor jobs (8K+). Our code appears to be occasionally deadlocking at random within point to point communication (see stacktrace below). This code has been tested on many different MPI versions and as far as we know it does not c