Re: [committed][nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt

Thomas Schwinge Wed, 23 Feb 2022 01:06:23 -0800

Hi Tom!

This is me again, following along GCC/nvptx devlopment, and asking
questions.  ;-)


On 2022-02-19T20:07:18+0100, Tom de Vries via Gcc-patches 
<gcc-patches@gcc.gnu.org> wrote:
> With the default ptx isa 6.0, we have for uniform-simt-1.c:
> ...
>         @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
>                 shfl.sync.idx.b32       %r26, %r26, %r32, 31, 0xffffffff;
> ...
>
> The atomic insn is predicated by -muniform-simt, and the subsequent insn does
> a warp sync, at which point the warp is uniform again.

I understand the concern here is Independent Thread Scheduling, where the
execution of predicated-off threads of a warp ('@ ! %r33') may proceed
with the next instruction, 'shfl', without implicitly waiting for the
other threads of a warp still working on the 'atom'?  Hence, the 'sync'
aspect of 'shfl.sync', as a means that PTX provides at the ISA level such
that we're getting the desired semantics: as its first step, "wait for
all threads in membermask to arrive".

> But with -mptx=3.1, we have instead:
> ...
>         @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
>                 shfl.idx.b32    %r26, %r26, %r32, 31;
> ...
>
> The shfl does not sync the warp, and we want the warp to go back to executing
> uniformly asap.  We cannot enforce this

Is it really the case that such code may cause "permanent" warp-divergent
execution (until re-converging "somewhere")?  My understanding has been
that predicated-off threads of a warp ('@ ! %r33') would simply idle,
implicitly waiting for the other threads of a warp still working on the
'atom' -- due to the nature of a shared program counter per warp, and the
desire to re-converge as soon as possible.

For example, PTX ISA 7.2, 3.1. "A Set of SIMT Multiprocessors":

| [...]
| At every instruction issue time, the SIMT unit selects a warp that is ready 
to execute and
| issues the next instruction to the active threads of the warp. A warp 
executes one common
| instruction at a time, so full efficiency is realized when all threads of a 
warp agree on their
| execution path. If threads of a warp diverge via a data-dependent conditional 
branch, the
| warp serially executes each branch path taken, disabling threads that are not 
on that path,
| and when all paths complete, the threads converge back to the same execution 
path. [...]

So I'd have assumed that after the potentially-diverging
'@%r33'-predicated 'atom' instruction, we're implicitly re-converging for
the unpredicated 'shfl' (as long as Independent Thread Scheduling isn't
involved, which it it's for '-mptx=3.1')?

As I'm understanding you, my understanding is not correct, and we may
thus be getting "permanent" warp-divergent execution as soon as there's
any predication/conditional involved that may evaluate differently for
individual threads of a warp, and we thus need such *explicit*
synchronization after all such instances?

> but at least check this using
> nvptx_uniform_warp_check, similar to how that is done for openacc.
>
> Likewise, detect the case that no shfl insn is emitted, and add a
> nvptx_uniform_warp_check or nvptx_warpsync.

For example, 'nvptx-none/mgomp/libatomic/cas_1_.o':

    [...]
     @ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61;
    +{
    +.reg .b32 act;
    +vote.ballot.b32 act,1;
    +.reg .pred uni;
    +setp.eq.b32 uni,act,0xffffffff;
    +@ ! uni trap;
    +@ ! uni exit;
    +}
     mov.b64 {%r69,%r70},%r62;
     shfl.idx.b32 %r69,%r69,%r68,31;
     shfl.idx.b32 %r70,%r70,%r68,31;
    [...]

So that's basically an 'assert' that all threads of a warp are converged.
(Is the JIT maybe even able to optimize that out?)  I guess I just wonder
if that's not satisfied implicitly.


Grüße
 Thomas


> [nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt
>
> gcc/ChangeLog:
>
> 2022-02-19  Tom de Vries  <tdevr...@suse.de>
>
>       * config/nvptx/nvptx.cc (nvptx_unisimt_handle_set): Change return
>       type to bool.
>       (nvptx_reorg_uniform_simt): Insert nvptx_uniform_warp_check or
>       nvptx_warpsync, if necessary.
>
> gcc/testsuite/ChangeLog:
>
> 2022-02-19  Tom de Vries  <tdevr...@suse.de>
>
>       * gcc.target/nvptx/uniform-simt-1.c: Add scan-assembler test.
>       * gcc.target/nvptx/uniform-simt-2.c: New test.
>
> ---
>  gcc/config/nvptx/nvptx.cc                       | 34 
> ++++++++++++++++++++++---
>  gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c |  1 +
>  gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c | 20 +++++++++++++++
>  3 files changed, 52 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
> index afbad5bdde6..4942f1100da 100644
> --- a/gcc/config/nvptx/nvptx.cc
> +++ b/gcc/config/nvptx/nvptx.cc
> @@ -3248,12 +3248,18 @@ nvptx_call_insn_is_syscall_p (rtx_insn *insn)
>  /* If SET subexpression of INSN sets a register, emit a shuffle instruction 
> to
>     propagate its value from lane MASTER to current lane.  */
>
> -static void
> +static bool
>  nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
>  {
>    rtx reg;
>    if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
> -    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), 
> insn);
> +    {
> +      emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX),
> +                    insn);
> +      return true;
> +    }
> +
> +  return false;
>  }
>
>  /* Adjust code for uniform-simt code generation variant by making atomics and
> @@ -3275,8 +3281,30 @@ nvptx_reorg_uniform_simt ()
>       continue;
>        rtx pat = PATTERN (insn);
>        rtx master = nvptx_get_unisimt_master ();
> +      bool shuffle_p = false;
>        for (int i = 0; i < XVECLEN (pat, 0); i++)
> -     nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
> +     shuffle_p
> +       |= nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
> +      if (shuffle_p && TARGET_PTX_6_0)
> +     {
> +       /* The shuffle is a sync, so uniformity is guaranteed.  */
> +     }
> +      else
> +     {
> +       if (TARGET_PTX_6_0)
> +         {
> +           gcc_assert (!shuffle_p);
> +           /* Emit after the insn, to guarantee uniformity.  */
> +           emit_insn_after (gen_nvptx_warpsync (), insn);
> +         }
> +       else
> +         {
> +           /* Emit after the insn (and before the shuffle, if there are any)
> +              to check uniformity.  */
> +           emit_insn_after (gen_nvptx_uniform_warp_check (), insn);
> +         }
> +     }
> +
>        rtx pred = nvptx_get_unisimt_predicate ();
>        pred = gen_rtx_NE (BImode, pred, const0_rtx);
>        pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
> diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c 
> b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
> index 1bc0adae014..77cffc40a66 100644
> --- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
> +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
> @@ -16,3 +16,4 @@ f (void)
>  }
>
>  /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
> +/* { dg-final { scan-assembler-times "shfl.sync.idx.b32" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c 
> b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
> new file mode 100644
> index 00000000000..0f1e4e780fe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
> @@ -0,0 +1,20 @@
> +/* { dg-options "-O2 -muniform-simt -mptx=3.1" } */
> +
> +enum memmodel
> +{
> +  MEMMODEL_RELAXED = 0,
> +};
> +
> +int a = 0;
> +
> +int
> +f (void)
> +{
> +  int expected = 1;
> +  return __atomic_compare_exchange_n (&a, &expected, 0, 0, MEMMODEL_RELAXED,
> +                                   MEMMODEL_RELAXED);
> +}
> +
> +/* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
> +/* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */
> +/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [committed][nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt

Reply via email to