https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104783

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vr...@gcc.gnu.org>:

https://gcc.gnu.org/g:f07178ca3c1e5dff799fb5016bb3767571db3165

commit r12-7586-gf07178ca3c1e5dff799fb5016bb3767571db3165
Author: Tom de Vries <tdevr...@suse.de>
Date:   Tue Mar 8 10:15:45 2022 +0100

    [nvptx] Disable warp sync in simt region

    I ran into a hang for this code:
    ...
      #pragma omp target map(tofrom: counter_N0)
      #pragma omp simd
      for (int i = 0 ; i < 1 ; i++ )
        {
          #pragma omp atomic update
          counter_N0 = counter_N0 + 1 ;
        }
    ...

    This has to do with the nature of -muniform-simt.  It has two modes of
    operation: inside and outside an SIMT region.

    Outside an SIMT region, a warp pretends to execute a single thread, but
    actually executes in all threads, to keep the local registers in all
threads
    consistent.  This approach works unless the insn that is executed is a
syscall
    or an atomic insn.  In that case, the insn is predicated, such that it
    executes in only one thread.  If the predicated insn writes a result to a
    register, then that register is propagated to the other threads, after
which
    the local registers in all threads are consistent again.

    Inside an SIMT region, a warp executes in all threads.  However, the
    predication and propagation for syscalls and atomic insns is also present
    here, because nvptx_reorg_uniform_simt works on all code.  Care has been
taken
    though to ensure that the predication and propagation is a nop.  That is,
    inside an SIMT region:
    - the predicate evalutes to true for each thread, and
    - the propagation insn copies a register from each thread to the same
thread.

    That works fine, until we use -mptx=6.0, and instead of using the
deprecated
    warp propagation insn shfl, we start using shfl.sync:
    ...
      @%r33 atom.add.u32            _, [%r29], 1;
            shfl.sync.idx.b32       %r30, %r30, %r32, 31, 0xffffffff;
    ...

    The shfl.sync specifies a member mask indicating all threads, but given
that
    the loop only has a single iteration, only thread 0 will execute the insn,
    where it will hang waiting for the other threads.

    Fix this by predicating the shfl.sync (and likewise, bar.warp.sync and the
    uniform warp check) such that it only executes outside the SIMT region.

    Tested on x86_64 with nvptx accelerator.

    gcc/ChangeLog:

    2022-03-08  Tom de Vries  <tdevr...@suse.de>

            PR target/104783
            * config/nvptx/nvptx.cc (nvptx_init_unisimt_predicate)
            (nvptx_output_unisimt_switch): Handle
unisimt_outside_simt_predicate.
            (nvptx_get_unisimt_outside_simt_predicate): New function.
            (predicate_insn): New function, factored out of ...
            (nvptx_reorg_uniform_simt): ... here.  Predicate all emitted insns.
            * config/nvptx/nvptx.h (struct machine_function): Add
            unisimt_outside_simt_predicate field.
            * config/nvptx/nvptx.md (define_insn "nvptx_warpsync")
            (define_insn "nvptx_uniform_warp_check"): Make predicable.

    libgomp/ChangeLog:

    2022-03-10  Tom de Vries  <tdevr...@suse.de>

            * testsuite/libgomp.c/pr104783.c: New test.

Reply via email to