Hi Tom! On 2022-04-01T13:23:06+0200, Tom de Vries <tdevr...@suse.de> wrote: > When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an > RTX A2000 (sm_86) with driver 510.60.02 I run into: > ... > FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \ > -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \ > output pattern test > ... > > The failing check verifies the launch dimensions: > ... > /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \ > launch gangs=1, workers=8, vectors=128" } */ > ... > which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers > is 6: > ... > nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128 > ... > > This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests > 'a launch configuration with reasonable occupancy') printed just before: > ... > cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768 > ... > [ Note: 6 * 128 == 768. ]
I had a while ago observed, and now finally looked into a similar case with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib. Looking at 'GOMP_DEBUG=1' output: '-O2'; all good: [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished ... vs. '-O0'; similar to your report: [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 768 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished ..., so I would've suggested: > Fix this by updating the check to allow num_workers in the range 1 to 8. ... to do this for '-O0' only, to make sure that we'll notice should the '-O2' case regress at some later point in time. Are you OK if I make the obvious a change? But that said... We might also generally classify this as a regression, because when using the GCC/nvptx '-mptx=3.1' instead of default multilib ('-foffload-options=nvptx-none=-mptx=3.1'), I see: '-O2'; all good (exactly the same launch configuration as with GCC/nvptx default multilib, see above): [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished ..., but also for -O0'; all good: Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished Are you able to reproduce that? Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3.1' multilib: Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = [-768-]{+1024+} nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, [-workers=6,-]{+workers=8,+} vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished Notice that the GCC/nvptx default multilib uses 33 registers vs. the '-mptx=3.1' multilib uses 30 registers! (..., which then allows for 'block = [-768-]{+1024+}', 'workers=[-6-]{+8+}'). If that's useful, 'diff' of the PTX code that gets loaded to the GPU: // BEGIN PREAMBLE -.version 6.0 +.version 3.1 .target sm_30 .address_size 64 // END PREAMBLE @@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0; add.u64 %r109,%r109,8; @ %r111 bra.uni $L11; $L19: -bar.warp.sync 0xffffffff; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} $L18: -barrier.sync.aligned 0; +bar.sync 0; // forked 2; @ %r113 bra $L12; cvta.shared.u64 %r101,__oacc_bcast; @@ -179,7 +187,15 @@ mov.u32 %r22,0; mov.u32 %r29,1; mov.u32 %r30,%ntid.y; $L12: -bar.warp.sync 0xffffffff; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} $L7: @ %r113 bra $L13; mov.u32 %r23,%tid.y; @@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31; selp.u32 %r114,1,0,%r62; st.u32 [%r93],%r114; $L13: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r115,[%r93]; setp.ne.u32 %r62,%r115,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r62 bra.uni $L2; $L6: @ %r113 bra $L14; @@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30; st.u32 [%r95+40],%r31; st.u32 [%r95+44],%r34; $L14: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; // forked 4; mov.u64 %r87,%r93; mov.u64 %r89,%frame; @@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38; mov.u32 %r56,%r37; st.u32 [%frame+8],%r56; // joining 4; -barrier.sync %r94,128; +bar.sync %r94,128; // join 4; @ %r113 bra $L15; add.u32 %r23,%r23,%r30; @@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31; selp.u32 %r116,1,0,%r84; st.u32 [%r93],%r116; $L15: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r117,[%r93]; setp.ne.u32 %r84,%r117,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r84 bra.uni $L6; $L2: @ %r113 bra $L16; @@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29; selp.u32 %r118,1,0,%r85; st.u32 [%r93],%r118; $L16: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r119,[%r93]; setp.ne.u32 %r85,%r119,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r85 bra.uni $L7; @ %r113 bra $L17; mov.u32 %r86,4; st.u32 [%frame+4],%r86; // joining 2; $L17: -bar.warp.sync 0xffffffff; -barrier.sync.aligned 0; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync 0; // join 2; ret; } Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register allocation? Does it follow that we should be doing something different in the GCC/nvptx default multilib, to achieve a similar outcome (without otherwise pessimizing the code, of course)? Grüße Thomas > [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c > > libgomp/ChangeLog: > > 2022-04-01 Tom de Vries <tdevr...@suse.de> > > * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix > num_workers check. > > --- > libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git > a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > index 4a8c1bf549e..92b3de03636 100644 > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > @@ -37,4 +37,4 @@ main (void) > } > > /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function > \\(1, 0, 128\\)" "oaccloops" } } */ > -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, > workers=8, vectors=128" } */ > +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, > workers=\[1-8\], vectors=128" } */ ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955