Hi Tom! This is me again, following along GCC/nvptx devlopment, and asking questions. ;-)
On 2022-02-19T20:07:18+0100, Tom de Vries via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > With the default ptx isa 6.0, we have for uniform-simt-1.c: > ... > @%r33 atom.global.cas.b32 %r26, [a], %r28, %r29; > shfl.sync.idx.b32 %r26, %r26, %r32, 31, 0xffffffff; > ... > > The atomic insn is predicated by -muniform-simt, and the subsequent insn does > a warp sync, at which point the warp is uniform again. I understand the concern here is Independent Thread Scheduling, where the execution of predicated-off threads of a warp ('@ ! %r33') may proceed with the next instruction, 'shfl', without implicitly waiting for the other threads of a warp still working on the 'atom'? Hence, the 'sync' aspect of 'shfl.sync', as a means that PTX provides at the ISA level such that we're getting the desired semantics: as its first step, "wait for all threads in membermask to arrive". > But with -mptx=3.1, we have instead: > ... > @%r33 atom.global.cas.b32 %r26, [a], %r28, %r29; > shfl.idx.b32 %r26, %r26, %r32, 31; > ... > > The shfl does not sync the warp, and we want the warp to go back to executing > uniformly asap. We cannot enforce this Is it really the case that such code may cause "permanent" warp-divergent execution (until re-converging "somewhere")? My understanding has been that predicated-off threads of a warp ('@ ! %r33') would simply idle, implicitly waiting for the other threads of a warp still working on the 'atom' -- due to the nature of a shared program counter per warp, and the desire to re-converge as soon as possible. For example, PTX ISA 7.2, 3.1. "A Set of SIMT Multiprocessors": | [...] | At every instruction issue time, the SIMT unit selects a warp that is ready to execute and | issues the next instruction to the active threads of the warp. A warp executes one common | instruction at a time, so full efficiency is realized when all threads of a warp agree on their | execution path. If threads of a warp diverge via a data-dependent conditional branch, the | warp serially executes each branch path taken, disabling threads that are not on that path, | and when all paths complete, the threads converge back to the same execution path. [...] So I'd have assumed that after the potentially-diverging '@%r33'-predicated 'atom' instruction, we're implicitly re-converging for the unpredicated 'shfl' (as long as Independent Thread Scheduling isn't involved, which it it's for '-mptx=3.1')? As I'm understanding you, my understanding is not correct, and we may thus be getting "permanent" warp-divergent execution as soon as there's any predication/conditional involved that may evaluate differently for individual threads of a warp, and we thus need such *explicit* synchronization after all such instances? > but at least check this using > nvptx_uniform_warp_check, similar to how that is done for openacc. > > Likewise, detect the case that no shfl insn is emitted, and add a > nvptx_uniform_warp_check or nvptx_warpsync. For example, 'nvptx-none/mgomp/libatomic/cas_1_.o': [...] @ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61; +{ +.reg .b32 act; +vote.ballot.b32 act,1; +.reg .pred uni; +setp.eq.b32 uni,act,0xffffffff; +@ ! uni trap; +@ ! uni exit; +} mov.b64 {%r69,%r70},%r62; shfl.idx.b32 %r69,%r69,%r68,31; shfl.idx.b32 %r70,%r70,%r68,31; [...] So that's basically an 'assert' that all threads of a warp are converged. (Is the JIT maybe even able to optimize that out?) I guess I just wonder if that's not satisfied implicitly. Grüße Thomas > [nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt > > gcc/ChangeLog: > > 2022-02-19 Tom de Vries <tdevr...@suse.de> > > * config/nvptx/nvptx.cc (nvptx_unisimt_handle_set): Change return > type to bool. > (nvptx_reorg_uniform_simt): Insert nvptx_uniform_warp_check or > nvptx_warpsync, if necessary. > > gcc/testsuite/ChangeLog: > > 2022-02-19 Tom de Vries <tdevr...@suse.de> > > * gcc.target/nvptx/uniform-simt-1.c: Add scan-assembler test. > * gcc.target/nvptx/uniform-simt-2.c: New test. > > --- > gcc/config/nvptx/nvptx.cc | 34 > ++++++++++++++++++++++--- > gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c | 1 + > gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c | 20 +++++++++++++++ > 3 files changed, 52 insertions(+), 3 deletions(-) > > diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc > index afbad5bdde6..4942f1100da 100644 > --- a/gcc/config/nvptx/nvptx.cc > +++ b/gcc/config/nvptx/nvptx.cc > @@ -3248,12 +3248,18 @@ nvptx_call_insn_is_syscall_p (rtx_insn *insn) > /* If SET subexpression of INSN sets a register, emit a shuffle instruction > to > propagate its value from lane MASTER to current lane. */ > > -static void > +static bool > nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master) > { > rtx reg; > if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set))) > - emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), > insn); > + { > + emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), > + insn); > + return true; > + } > + > + return false; > } > > /* Adjust code for uniform-simt code generation variant by making atomics and > @@ -3275,8 +3281,30 @@ nvptx_reorg_uniform_simt () > continue; > rtx pat = PATTERN (insn); > rtx master = nvptx_get_unisimt_master (); > + bool shuffle_p = false; > for (int i = 0; i < XVECLEN (pat, 0); i++) > - nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master); > + shuffle_p > + |= nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master); > + if (shuffle_p && TARGET_PTX_6_0) > + { > + /* The shuffle is a sync, so uniformity is guaranteed. */ > + } > + else > + { > + if (TARGET_PTX_6_0) > + { > + gcc_assert (!shuffle_p); > + /* Emit after the insn, to guarantee uniformity. */ > + emit_insn_after (gen_nvptx_warpsync (), insn); > + } > + else > + { > + /* Emit after the insn (and before the shuffle, if there are any) > + to check uniformity. */ > + emit_insn_after (gen_nvptx_uniform_warp_check (), insn); > + } > + } > + > rtx pred = nvptx_get_unisimt_predicate (); > pred = gen_rtx_NE (BImode, pred, const0_rtx); > pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat); > diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c > b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c > index 1bc0adae014..77cffc40a66 100644 > --- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c > +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c > @@ -16,3 +16,4 @@ f (void) > } > > /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */ > +/* { dg-final { scan-assembler-times "shfl.sync.idx.b32" 1 } } */ > diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c > b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c > new file mode 100644 > index 00000000000..0f1e4e780fe > --- /dev/null > +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c > @@ -0,0 +1,20 @@ > +/* { dg-options "-O2 -muniform-simt -mptx=3.1" } */ > + > +enum memmodel > +{ > + MEMMODEL_RELAXED = 0, > +}; > + > +int a = 0; > + > +int > +f (void) > +{ > + int expected = 1; > + return __atomic_compare_exchange_n (&a, &expected, 0, 0, MEMMODEL_RELAXED, > + MEMMODEL_RELAXED); > +} > + > +/* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */ > +/* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */ > +/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */ ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955