On 03/30/2018 07:45 AM, Tom de Vries wrote: > On 03/30/2018 03:07 AM, Tom de Vries wrote: >> On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >>> As a follow up patch will show, the nvptx BE falls back to using >>> vector_length = 32 when a vector loop is nested inside a worker loop. >> >> I disabled the fallback, and analyzed the vred2d-128.c illegal memory >> access execution failure. >> >> I minimized that down to this ptx: >> ... >> .shared .align 8 .u8 __oacc_bcast[176]; >> >> { >> { >> .reg .u32 %x; >> mov.u32 %x,%tid.x; >> setp.ne.u32 %r86,%x,0; >> } >> >> { >> .reg .u32 %tidy; >> .reg .u64 %t_bcast; >> .reg .u64 %y64; >> mov.u32 %tidy,%tid.y; >> cvt.u64.u32 %y64,%tidy; >> add.u64 %y64,%y64,1; >> cvta.shared.u64 %t_bcast,__oacc_bcast; >> mad.lo.u64 %r66,%y64,88,%t_bcast; >> } >> >> @ %r86 bra $L28; >> st.u32 [%r66+80],0; >> $L28: >> ret; >> } >> ... >> >> The ptx is called with 2 workers and 128 vector_length. >> >> So, 2 workers mean %tid.y has values 0 and 1. >> Then %y64 has values 1 and 2. >> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88). >> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast >> + (2 * 88) + 80. >> >> So we're accessing memory at location 256, while the __oacc_bcast is >> only 176 bytes big. >> >> I formulated this assert that AFAIU detects this situation in the >> compiler: >> ... >> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int >> regno, const char *name) >> fprintf (file, "\t}\n"); >> } >> >> +static int nvptx_mach_max_workers (); >> + >> /* Emit code to initialize OpenACC worker broadcast and synchronization >> registers. */ >> >> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file) >> "// vector broadcast offset\n", >> REGNO (cfun->machine->bcast_partition), >> oacc_bcast_partition); >> + gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + >> 1) <= oacc_bcast_size); >> } >> if (cfun->machine->sync_bar) >> fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " >> ... >> >> The assert is not triggered when the fallback is used. > > I've tracked the problem down to: > ... >> - if (oacc_bcast_size < >> data.offset) >> >> - oacc_bcast_size = >> data.offset; >> >> + if (oacc_bcast_partition < >> data.offset) >> >> + >> { >> >> + int psize = >> data.offset; >> >> + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align >> - 1); + >> oacc_bcast_partition = >> psize; >> >> + oacc_bcast_size = psize * (nvptx_mach_max_workers () + >> 1); + >> } >> > > ... > > We hit this if clause for a first compiled function, with num_workers(1). > > This sets oacc_bcast_partition and oacc_bcast_size as required for that > functions. > > Then we hit this if clause for a second compiled function, with > num_workers (2). > > We need oacc_bcast_size updated, but the 'oacc_bcast_partition < > data.offset' is false, so the update doesn't happen. > > I managed to fix this by making the code unconditional, and using MAX to > update oacc_bcast_partition and oacc_bcast_size.
It looks like that's fallout from this patch <https://gcc.gnu.org/ml/gcc-patches/2018-03/msg01212.html>. I should have checked that patch with the vector length fallback disabled. Cesar