On 10/20/2015 11:51 PM, Alexander Monakov wrote:
On Tue, 20 Oct 2015, Bernd Schmidt wrote:
My experience has been that there is practically no way of using bar.sync
reliably, since we can't control warp divergence and reconvergence at the
ptx level but the hardware bar.sync instruction only works when executed by
all threads in a warp at the same time.
I don't think it's that bad. Divergence and reconvergence are implicit: a
non-uniform branch is a divergence point, and the corresponding reconvergence
point is at its immediate post-dominator.
That's good in theory, but I have seen cases where very odd things
seemed to be happening in ptxas, and another problem is that gcc is
quite unconcerned about maintaining such reconvergence points in its
optimization passes.
For bar.sync behavior the documentation gives an explicit guarantee: every
time a warp encounters a bar.sync instruction, it bumps the count by the warp
width (32), irrespective of how many warp members are active at the time of
encounter.
Yeah, but that's undesirable: you can breeze right past a bar.sync
before the thing you wanted to synchronize has completed.
Bernd