On 03/30/2018 03:07 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.
I disabled the fallback, and analyzed the vred2d-128.c illegal memory
access execution failure.
I minimized that down to this ptx:
...
.shared .align 8 .u8 __oacc_bcast[176];
{
{
.reg .u32 %x;
mov.u32 %x,%tid.x;
setp.ne.u32 %r86,%x,0;
}
{
.reg .u32 %tidy;
.reg .u64 %t_bcast;
.reg .u64 %y64;
mov.u32 %tidy,%tid.y;
cvt.u64.u32 %y64,%tidy;
add.u64 %y64,%y64,1;
cvta.shared.u64 %t_bcast,__oacc_bcast;
mad.lo.u64 %r66,%y64,88,%t_bcast;
}
@ %r86 bra $L28;
st.u32 [%r66+80],0;
$L28:
ret;
}
...
The ptx is called with 2 workers and 128 vector_length.
So, 2 workers mean %tid.y has values 0 and 1.
Then %y64 has values 1 and 2.
Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast +
(2 * 88) + 80.
So we're accessing memory at location 256, while the __oacc_bcast is
only 176 bytes big.
I formulated this assert that AFAIU detects this situation in the compiler:
...
@@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno,
const char *name)
fprintf (file, "\t}\n");
}
+static int nvptx_mach_max_workers ();
+
/* Emit code to initialize OpenACC worker broadcast and synchronization
registers. */
@@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
"// vector broadcast offset\n",
REGNO (cfun->machine->bcast_partition),
oacc_bcast_partition);
+ gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () +
1) <= oacc_bcast_size);
}
if (cfun->machine->sync_bar)
fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
...
The assert is not triggered when the fallback is used.
I've tracked the problem down to:
...
- if (oacc_bcast_size < data.offset)
- oacc_bcast_size = data.offset;
+ if (oacc_bcast_partition < data.offset)
+ {
+ int psize = data.offset;
+ psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
+ oacc_bcast_partition = psize;
+ oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+ }
...
We hit this if clause for a first compiled function, with num_workers(1).
This sets oacc_bcast_partition and oacc_bcast_size as required for that
functions.
Then we hit this if clause for a second compiled function, with
num_workers (2).
We need oacc_bcast_size updated, but the 'oacc_bcast_partition <
data.offset' is false, so the update doesn't happen.
I managed to fix this by making the code unconditional, and using MAX to
update oacc_bcast_partition and oacc_bcast_size.
Thanks,
- Tom