Charles, There is a known issue with calling collectives on a tight loop, due to lack of control flow at the network level. It results in a significant slow-down, that might appear as a deadlock to users. The work around this is to enable the sync collective module, that will insert a fake barrier at regular intervals in the tight collective loop, allowing a more streamlined usage of the network.
Run `ompi_info --param coll sync -l 9` to see the options you need to play with. I think setting one of the coll_sync_barrier_before or coll_sync_barrier_after to anything larger than a few tens should be good enough. George. On Mon, Oct 28, 2019 at 9:29 PM Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Charles, > > > unless you expect yes or no answers, can you please post a simple > program that evidences > > the issue you are facing ? > > > Cheers, > > > Gilles > > On 10/29/2019 6:37 AM, Garrett, Charles via users wrote: > > > > Does anyone have any idea why this is happening? Has anyone seen this > > problem before? > > >