Re: [OMPI users] Program hangs in mpi_bcast

Jeff Squyres Wed, 30 Nov 2011 15:45:34 -0500

Fair enough.  Thanks anyway!

On Nov 30, 2011, at 3:39 PM, Tom Rosmond wrote:


> Jeff,
> 
> I'm afraid trying to produce a reproducer of this problem wouldn't be
> worth the effort.  It is a legacy code that I wasn't involved in
> developing and will soon be discarded, so I can't justify spending time
> trying to understand its behavior better.  The bottom line is that it
> works correctly with the small 'sync' value, and because it isn't very
> expensive to run, that is enough for us.
> 
> T. Rosmond
> 
> 
> On Wed, 2011-11-30 at 15:29 -0500, Jeff Squyres wrote:
>> Yes, but I'd like to see a reproducer that requires setting the 
>> sync_barrier_before=5.  Your reproducers allowed much higher values, IIRC.
>> 
>> I'm curious to know what makes that code require such a low value (i.e., 
>> 5)...
>> 
>> 
>> On Nov 30, 2011, at 1:50 PM, Ralph Castain wrote:
>> 
>>> FWIW: we already have a reproducer from prior work I did chasing this down 
>>> a couple of years ago. See orte/test/mpi/bcast_loop.c
>>> 
>>> 
>>> On Nov 29, 2011, at 9:35 AM, Jeff Squyres wrote:
>>> 
>>>> That's quite weird/surprising that you would need to set it down to *5* -- 
>>>> that's really low.
>>>> 
>>>> Can you share a simple reproducer code, perchance?
>>>> 
>>>> 
>>>> On Nov 15, 2011, at 11:49 AM, Tom Rosmond wrote:
>>>> 
>>>>> Ralph,
>>>>> 
>>>>> Thanks for the advice.  I have to set 'coll_sync_barrier_before=5' to do
>>>>> the job.  This is a big change from the default value (1000), so our
>>>>> application seems to be a pretty extreme case.
>>>>> 
>>>>> T. Rosmond
>>>>> 
>>>>> 
>>>>> On Mon, 2011-11-14 at 16:17 -0700, Ralph Castain wrote:
>>>>>> Yes, this is well documented - may be on the FAQ, but certainly has been 
>>>>>> in the user list multiple times.
>>>>>> 
>>>>>> The problem is that one process falls behind, which causes it to begin 
>>>>>> accumulating "unexpected messages" in its queue. This causes the 
>>>>>> matching logic to run a little slower, thus making the process fall 
>>>>>> further and further behind. Eventually, things hang because everyone is 
>>>>>> sitting in bcast waiting for the slow proc to catch up, but it's queue 
>>>>>> is saturated and it can't.
>>>>>> 
>>>>>> The solution is to do exactly what you describe - add some barriers to 
>>>>>> force the slow process to catch up. This happened enough that we even 
>>>>>> added support for it in OMPI itself so you don't have to modify your 
>>>>>> code. Look at the following from "ompi_info --param coll sync"
>>>>>> 
>>>>>>             MCA coll: parameter "coll_base_verbose" (current value: <0>, 
>>>>>> data source: default value)
>>>>>>                       Verbosity level for the coll framework (0 = no 
>>>>>> verbosity)
>>>>>>             MCA coll: parameter "coll_sync_priority" (current value: 
>>>>>> <50>, data source: default value)
>>>>>>                       Priority of the sync coll component; only relevant 
>>>>>> if barrier_before or barrier_after is > 0
>>>>>>            MCA coll: parameter "coll_sync_barrier_before" (current 
>>>>>> value: <1000>, data source: default value)
>>>>>>                       Do a synchronization before each Nth collective
>>>>>>             MCA coll: parameter "coll_sync_barrier_after" (current 
>>>>>> value: <0>, data source: default value)
>>>>>>                       Do a synchronization after each Nth collective
>>>>>> 
>>>>>> Take your pick - inserting a barrier before or after doesn't seem to 
>>>>>> make a lot of difference, but most people use "before". Try different 
>>>>>> values until you get something that works for you.
>>>>>> 
>>>>>> 
>>>>>> On Nov 14, 2011, at 3:10 PM, Tom Rosmond wrote:
>>>>>> 
>>>>>>> Hello:
>>>>>>> 
>>>>>>> A colleague and I have been running a large F90 application that does an
>>>>>>> enormous number of mpi_bcast calls during execution.  I deny any
>>>>>>> responsibility for the design of the code and why it needs these calls,
>>>>>>> but it is what we have inherited and have to work with.
>>>>>>> 
>>>>>>> Recently we ported the code to an 8 node, 6 processor/node NUMA system
>>>>>>> (lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
>>>>>>> and began having trouble with mysterious 'hangs' in the program inside
>>>>>>> the mpi_bcast calls.  The hangs were always in the same calls, but not
>>>>>>> necessarily at the same time during integration.  We originally didn't
>>>>>>> have NUMA support, so reinstalled with libnuma support added, but the
>>>>>>> problem persisted.  Finally, just as a wild guess, we inserted
>>>>>>> 'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
>>>>>>> now runs without problems.
>>>>>>> 
>>>>>>> I believe conventional wisdom is that properly formulated MPI programs
>>>>>>> should run correctly without barriers, so do you have any thoughts on
>>>>>>> why we found it necessary to add them?  The code has run correctly on
>>>>>>> other architectures, i.g. Crayxe6, so I don't think there is a bug
>>>>>>> anywhere.  My only explanation is that some internal resource gets
>>>>>>> exhausted because of the large number of 'mpi_bcast' calls in rapid
>>>>>>> succession, and the barrier calls force synchronization which allows the
>>>>>>> resource to be restored.  Does this make sense?  I'd appreciate any
>>>>>>> comments and advice you can provide.
>>>>>>> 
>>>>>>> 
>>>>>>> I have attached compressed copies of config.log and ompi_info for the
>>>>>>> system.  The program is built with ifort 12.0 and typically runs with 
>>>>>>> 
>>>>>>> mpirun -np 36 -bycore -bind-to-core program.exe
>>>>>>> 
>>>>>>> We have run both interactively and with PBS, but that doesn't seem to
>>>>>>> make any difference in program behavior.
>>>>>>> 
>>>>>>> T. Rosmond
>>>>>>> 
>>>>>>> 
>>>>>>> <lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Program hangs in mpi_bcast

Reply via email to