Charles,
Having implemented some of the underlying collective algorithms, I am
puzzled by the need to force the sync to 1 to have things flowing. I would
definitely appreciate a reproducer so that I can identify (and hopefully)
fix the underlying problem.
Thanks,
George.
On Tue, Oct 29, 2019
Last time I did a reply on here, it created a new thread. Sorry about that
everyone. I just hit the Reply via email button. Hopefully this one will work.
To Gilles Gouaillardet:
My first thread has a reproducer that causes the problem.
To Beorge Bosilca:
I had to set coll_sync_barrier_before=
Charles,
There is a known issue with calling collectives on a tight loop, due to
lack of control flow at the network level. It results in a significant
slow-down, that might appear as a deadlock to users. The work around this
is to enable the sync collective module, that will insert a fake barrier
Charles,
unless you expect yes or no answers, can you please post a simple
program that evidences
the issue you are facing ?
Cheers,
Gilles
On 10/29/2019 6:37 AM, Garrett, Charles via users wrote:
Does anyone have any idea why this is happening? Has anyone seen this
problem before?
Does anyone have any idea why this is happening? Has anyone seen this problem
before?
I have a problem where MPI_Bcast hangs when called rapidly over and over again.
This problem manifests itself on our new cluster, but not on our older one.
The new cluster has Cascade Lake processors. Each node contains 2 sockets with
18 cores per socket. Cluster size is 128 nodes with an ED