On Tue, 20 Oct 2015, Bernd Schmidt wrote: > On 10/20/2015 08:34 PM, Alexander Monakov wrote: > > On NVPTX, there's 16 hardware barriers for each thread team, each barrier > > has > > a variable waiter count. The instruction 'bar.sync N, M;' allows to wait on > > barrier number N until M threads have arrived. M should be pre-multiplied > > by > > warp width. It's also possible to 'post' the barrier without suspending > > with > > 'bar.arrive'. > > > > We should be able to provide gomp barrier via a combination of ptx barriers > > and atomics. This patch is a first step in that direction. > > > > It's mostly a copy of Linux implementation, and it's very likely that > > functions more complex than gomp_barrier_wait_end are implemented > > incorrectly. > > I will have to review all of that (and optimize, hopefully). > > > > I'm not sure if naked asm()'s are OK. It's possible to implement a builtin > > instead for a minor beautification. Thoughts? > > I have no concerns about naked asms. I'm more concerned about whether this > actually works - how much testing has this had?
It does survive libgomp c/c++ tests, which makes use of the simplest barrier, gomp_barrier_wait_end, at least. > My experience has been that there is practically no way of using bar.sync > reliably, since we can't control warp divergence and reconvergence at the > ptx level but the hardware bar.sync instruction only works when executed by > all threads in a warp at the same time. I don't think it's that bad. Divergence and reconvergence are implicit: a non-uniform branch is a divergence point, and the corresponding reconvergence point is at its immediate post-dominator. Though I do miss a possibility to force reconvergence at a given point, "resurrecting" masked-out warp members. For bar.sync behavior the documentation gives an explicit guarantee: every time a warp encounters a bar.sync instruction, it bumps the count by the warp width (32), irrespective of how many warp members are active at the time of encounter. Alexander