https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104364
Bug ID: 104364 Summary: [12 Regression] OpenMP/nvptx regressions after "[nvptx] Add some support for .local atomics" Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: openmp Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tschwinge at gcc dot gnu.org CC: vries at gcc dot gnu.org Target Milestone: --- Target: nvptx As we've (slightly) discussed before, I acknowledge that commit r12-6966-ge0451f93d9faa13495132f4e246e9bef30b51417 "[nvptx] Add some support for .local atomics" is a good idea, thanks! (Though I wonder, wouldn't it be possible for the Driver/JIT to handle that internally? Aha, you had the same comment; first sentence in <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96494#c2>.) For certain GPU/Driver combinations, this change however causes regressions in OpenMP/nvptx testing: PASS: libgomp.c/../libgomp.c-c++-common/for-11.c (test for excess errors) [-PASS:-]{+WARNING: program timed out.+} {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-11.c execution test PASS: libgomp.c/../libgomp.c-c++-common/for-12.c (test for excess errors) [-PASS:-]{+WARNING: program timed out.+} {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-12.c execution test PASS: libgomp.c/../libgomp.c-c++-common/for-3.c (test for excess errors) [-PASS:-]{+WARNING: program timed out.+} {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-3.c execution test PASS: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors) [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-5.c execution test PASS: libgomp.c/../libgomp.c-c++-common/for-6.c (test for excess errors) [-PASS:-]{+WARNING: program timed out.+} {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-6.c execution test PASS: libgomp.c/../libgomp.c-c++-common/for-9.c (test for excess errors) [-PASS:-]{+WARNING: program timed out.+} {+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-9.c execution test But varying, not always all of these. Sometimes with, sometimes without the timeout. Similar for C++. ... but only seen regressing for: - Nvidia Tesla K20c, Driver Version: 346.46 - Nvidia Tesla K20c, Driver Version: 455.38 - Nvidia Tesla K40c, Driver Version: 455.38 - Nvidia Tesla K80, Driver Version: 440.33.01 ... and never seen regressing for: - Nvidia Quadro P1000, Driver Version: 450.119.03 - Nvidia GeForce GTX 1080, Driver Version: 440.33.01 - Nvidia TITAN V, Driver Version: 455.23.05 (What is the underlying characteristic here?) I did notice that this change affects nvptx target libraries -- but OpenMP 'mgomp' multilib only -- as follows: 'nvptx-none/mgomp/libatomic/cas_1_.o' (complete diff): @@ -113,7 +113,7 @@ .loc 3 80 9 or.b64 %r61,%r60,%r39; .loc 3 82 11 -@ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61; +atom.cas.b64 %r62,[%r35],%r29,%r61; mov.b64 {%r69,%r70},%r62; shfl.idx.b32 %r69,%r69,%r68,31; shfl.idx.b32 %r70,%r70,%r68,31; Similar for 'nvptx-none/mgomp/libatomic/cas_2_.o'. 'nvptx-none/mgomp/libatomic/cas_4_.o' (complete diff): @@ -50,9 +50,9 @@ mov.u32 %r25,%ar2; .loc 2 41 12 ld.u32 %r28,[%r24]; -@ %r37 membar.sys; -@ %r37 atom.cas.b32 %r29,[%r23],%r28,%r25; -@ %r37 membar.sys; +membar.sys; +atom.cas.b32 %r29,[%r23],%r28,%r25; +membar.sys; shfl.idx.b32 %r29,%r29,%r36,31; setp.eq.u32 %r31,%r29,%r28; selp.u32 %r30,1,0,%r31; Similar for 'nvptx-none/mgomp/libatomic/cas_8_.o'. Simlarly for several others in 'nvptx-none/mgomp/libatomic/', 'nvptx-none/mgomp/libbacktrace/dwarf.o', 'nvptx-none/mgomp/libgcc/atomic.o', 'nvptx-none/mgomp/libgfortran/single.o', several in 'nvptx-none/mgomp/libgomp/'. That is, remove predication from all the "atomic" instructions. I of course do see the code changes related to that in the commit, but I do wonder whether that's all intentional/correct? (Or, is the 'atom' provably a no-op if not '@ %r37'?) And, in 'nvptx-none/mgomp/libgomp/task.o' we've got several things like: -@ %r330 atom.cas.b32 %r208,[%frame+16],%r80,%r207; -@ %r330 membar.sys; +{ +.reg .pred %eq_p; +.reg .u32 %val; +ld.u32 %val,[%frame+16]; +setp.eq.u32 %eq_p,%val,%r80; +@ %eq_p st.u32 [%frame+16],%r207; +mov.u32 %r208,%val; +} So these are the non-'atom' replacement sequences that now get synthesized (haven't verified) -- but again this one lost the '@ %r330' predication from all the "atomic" instructions? And, is it correct here to use the non-'atom' replacement, though? '%frame' comes from: .visible .func GOMP_taskwait { .reg .u64 %stack; .reg .u64 %frame; .reg .u64 %sspslot; .reg .u64 %sspprev; { .reg .u32 %fstmp0; .reg .u64 %fstmp1; .reg .u64 %fstmp2; mov.u32 %fstmp0,%tid.y; mul.wide.u32 %fstmp1,%fstmp0,8; mov.u64 %fstmp2,__nvptx_stacks; add.u64 %sspslot,%fstmp2,%fstmp1; ld.shared.u64 %sspprev,[%sspslot]; sub.u64 %frame,%sspprev,32; sub.u64 %stack,%frame,16; st.shared.u64 [%sspslot],%stack; } [...] ... which -- at least from my superficial look -- seems to be some kind of 'shared' location, thus mandating 'atom' access? (But haven't looked in detail, so I may certainly be wrong, of course.) Are these changes and/or similar changes in nvptx code generation responsible for the regressing test cases? (Again, I have not looked.) I'll report in case anything here changes due to Tom's further recent GCC/nvptx back end commits.