nvptx regressions after "[nvptx] Add some support for .local atomics"

tschwinge at gcc dot gnu.org via Gcc-bugs Thu, 03 Feb 2022 01:53:49 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104364


            Bug ID: 104364
           Summary: [12 Regression] OpenMP/nvptx regressions after
                    "[nvptx] Add some support for .local atomics"
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: openmp
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

As we've (slightly) discussed before, I acknowledge that commit
r12-6966-ge0451f93d9faa13495132f4e246e9bef30b51417 "[nvptx] Add some support
for .local atomics" is a good idea, thanks!  (Though I wonder, wouldn't it be
possible for the Driver/JIT to handle that internally?  Aha, you had the same
comment; first sentence in
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96494#c2>.)

For certain GPU/Driver combinations, this change however causes regressions in
OpenMP/nvptx testing:

    PASS: libgomp.c/../libgomp.c-c++-common/for-11.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-11.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-12.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-12.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-3.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-3.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-5.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-6.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-6.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/for-9.c (test for excess errors)
    [-PASS:-]{+WARNING: program timed out.+}
    {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-9.c execution test

But varying, not always all of these.

Sometimes with, sometimes without the timeout.

Similar for C++.

... but only seen regressing for:

  - Nvidia Tesla K20c, Driver Version: 346.46
  - Nvidia Tesla K20c, Driver Version: 455.38
  - Nvidia Tesla K40c, Driver Version: 455.38
  - Nvidia Tesla K80, Driver Version: 440.33.01

... and never seen regressing for:

  - Nvidia Quadro P1000, Driver Version: 450.119.03
  - Nvidia GeForce GTX 1080, Driver Version: 440.33.01
  - Nvidia TITAN V, Driver Version: 455.23.05

(What is the underlying characteristic here?)


I did notice that this change affects nvptx target libraries -- but OpenMP
'mgomp' multilib only -- as follows:

'nvptx-none/mgomp/libatomic/cas_1_.o' (complete diff):

    @@ -113,7 +113,7 @@
     .loc 3 80 9
     or.b64 %r61,%r60,%r39;
     .loc 3 82 11
    -@ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61;
    +atom.cas.b64 %r62,[%r35],%r29,%r61;
     mov.b64 {%r69,%r70},%r62;
     shfl.idx.b32 %r69,%r69,%r68,31;
     shfl.idx.b32 %r70,%r70,%r68,31;

Similar for 'nvptx-none/mgomp/libatomic/cas_2_.o'.

'nvptx-none/mgomp/libatomic/cas_4_.o' (complete diff):

    @@ -50,9 +50,9 @@
     mov.u32 %r25,%ar2;
     .loc 2 41 12
     ld.u32 %r28,[%r24];
    -@ %r37 membar.sys;
    -@ %r37 atom.cas.b32 %r29,[%r23],%r28,%r25;
    -@ %r37 membar.sys;
    +membar.sys;
    +atom.cas.b32 %r29,[%r23],%r28,%r25;
    +membar.sys;
     shfl.idx.b32 %r29,%r29,%r36,31;
     setp.eq.u32 %r31,%r29,%r28;
     selp.u32 %r30,1,0,%r31;

Similar for 'nvptx-none/mgomp/libatomic/cas_8_.o'.

Simlarly for several others in 'nvptx-none/mgomp/libatomic/',
'nvptx-none/mgomp/libbacktrace/dwarf.o', 'nvptx-none/mgomp/libgcc/atomic.o',
'nvptx-none/mgomp/libgfortran/single.o', several in
'nvptx-none/mgomp/libgomp/'.

That is, remove predication from all the "atomic" instructions.  I of course do
see the code changes related to that in the commit, but I do wonder whether
that's all intentional/correct?  (Or, is the 'atom' provably a no-op if not '@
%r37'?)

And, in 'nvptx-none/mgomp/libgomp/task.o' we've got several things like:

    -@ %r330 atom.cas.b32 %r208,[%frame+16],%r80,%r207;
    -@ %r330 membar.sys;
    +{
    +.reg .pred %eq_p;
    +.reg .u32 %val;
    +ld.u32 %val,[%frame+16];
    +setp.eq.u32 %eq_p,%val,%r80;
    +@ %eq_p st.u32 [%frame+16],%r207;
    +mov.u32 %r208,%val;
    +}

So these are the non-'atom' replacement sequences that now get synthesized
(haven't verified) -- but again this one lost the '@ %r330' predication from
all the "atomic" instructions?

And, is it correct here to use the non-'atom' replacement, though?  '%frame'
comes from:

    .visible .func GOMP_taskwait
    {
    .reg .u64 %stack;
    .reg .u64 %frame;
    .reg .u64 %sspslot;
    .reg .u64 %sspprev;
    {
    .reg .u32 %fstmp0;
    .reg .u64 %fstmp1;
    .reg .u64 %fstmp2;
    mov.u32 %fstmp0,%tid.y;
    mul.wide.u32 %fstmp1,%fstmp0,8;
    mov.u64 %fstmp2,__nvptx_stacks;
    add.u64 %sspslot,%fstmp2,%fstmp1;
    ld.shared.u64 %sspprev,[%sspslot];
    sub.u64 %frame,%sspprev,32;
    sub.u64 %stack,%frame,16;
    st.shared.u64 [%sspslot],%stack;
    }
    [...]

... which -- at least from my superficial look -- seems to be some kind of
'shared' location, thus mandating 'atom' access?  (But haven't looked in
detail, so I may certainly be wrong, of course.)

Are these changes and/or similar changes in nvptx code generation responsible
for the regressing test cases?  (Again, I have not looked.)


I'll report in case anything here changes due to Tom's further recent GCC/nvptx
back end commits.

[Bug target/104364] New: [12 Regression] OpenMP/nvptx regressions after "[nvptx] Add some support for .local atomics"

Reply via email to