Oh - and another one at orte/test/mpi/reduce-hang.c
On Nov 30, 2011, at 11:50 AM, Ralph Castain wrote: > FWIW: we already have a reproducer from prior work I did chasing this down a > couple of years ago. See orte/test/mpi/bcast_loop.c > > > On Nov 29, 2011, at 9:35 AM, Jeff Squyres wrote: > >> That's quite weird/surprising that you would need to set it down to *5* -- >> that's really low. >> >> Can you share a simple reproducer code, perchance? >> >> >> On Nov 15, 2011, at 11:49 AM, Tom Rosmond wrote: >> >>> Ralph, >>> >>> Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do >>> the job. This is a big change from the default value (1000), so our >>> application seems to be a pretty extreme case. >>> >>> T. Rosmond >>> >>> >>> On Mon, 2011-11-14 at 16:17 -0700, Ralph Castain wrote: >>>> Yes, this is well documented - may be on the FAQ, but certainly has been >>>> in the user list multiple times. >>>> >>>> The problem is that one process falls behind, which causes it to begin >>>> accumulating "unexpected messages" in its queue. This causes the matching >>>> logic to run a little slower, thus making the process fall further and >>>> further behind. Eventually, things hang because everyone is sitting in >>>> bcast waiting for the slow proc to catch up, but it's queue is saturated >>>> and it can't. >>>> >>>> The solution is to do exactly what you describe - add some barriers to >>>> force the slow process to catch up. This happened enough that we even >>>> added support for it in OMPI itself so you don't have to modify your code. >>>> Look at the following from "ompi_info --param coll sync" >>>> >>>> MCA coll: parameter "coll_base_verbose" (current value: <0>, >>>> data source: default value) >>>> Verbosity level for the coll framework (0 = no >>>> verbosity) >>>> MCA coll: parameter "coll_sync_priority" (current value: >>>> <50>, data source: default value) >>>> Priority of the sync coll component; only relevant >>>> if barrier_before or barrier_after is > 0 >>>> MCA coll: parameter "coll_sync_barrier_before" (current value: >>>> <1000>, data source: default value) >>>> Do a synchronization before each Nth collective >>>> MCA coll: parameter "coll_sync_barrier_after" (current value: >>>> <0>, data source: default value) >>>> Do a synchronization after each Nth collective >>>> >>>> Take your pick - inserting a barrier before or after doesn't seem to make >>>> a lot of difference, but most people use "before". Try different values >>>> until you get something that works for you. >>>> >>>> >>>> On Nov 14, 2011, at 3:10 PM, Tom Rosmond wrote: >>>> >>>>> Hello: >>>>> >>>>> A colleague and I have been running a large F90 application that does an >>>>> enormous number of mpi_bcast calls during execution. I deny any >>>>> responsibility for the design of the code and why it needs these calls, >>>>> but it is what we have inherited and have to work with. >>>>> >>>>> Recently we ported the code to an 8 node, 6 processor/node NUMA system >>>>> (lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3, >>>>> and began having trouble with mysterious 'hangs' in the program inside >>>>> the mpi_bcast calls. The hangs were always in the same calls, but not >>>>> necessarily at the same time during integration. We originally didn't >>>>> have NUMA support, so reinstalled with libnuma support added, but the >>>>> problem persisted. Finally, just as a wild guess, we inserted >>>>> 'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program >>>>> now runs without problems. >>>>> >>>>> I believe conventional wisdom is that properly formulated MPI programs >>>>> should run correctly without barriers, so do you have any thoughts on >>>>> why we found it necessary to add them? The code has run correctly on >>>>> other architectures, i.g. Crayxe6, so I don't think there is a bug >>>>> anywhere. My only explanation is that some internal resource gets >>>>> exhausted because of the large number of 'mpi_bcast' calls in rapid >>>>> succession, and the barrier calls force synchronization which allows the >>>>> resource to be restored. Does this make sense? I'd appreciate any >>>>> comments and advice you can provide. >>>>> >>>>> >>>>> I have attached compressed copies of config.log and ompi_info for the >>>>> system. The program is built with ifort 12.0 and typically runs with >>>>> >>>>> mpirun -np 36 -bycore -bind-to-core program.exe >>>>> >>>>> We have run both interactively and with PBS, but that doesn't seem to >>>>> make any difference in program behavior. >>>>> >>>>> T. Rosmond >>>>> >>>>> >>>>> <lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >