On 03/30/2018 03:07 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.

I disabled the fallback, and analyzed the vred2d-128.c illegal memory access execution failure.

I minimized that down to this ptx:
...
.shared .align 8 .u8 __oacc_bcast[176];

{
   {
     .reg .u32 %x;
     mov.u32 %x,%tid.x;
     setp.ne.u32 %r86,%x,0;
   }

   {
     .reg .u32 %tidy;
     .reg .u64 %t_bcast;
     .reg .u64 %y64;
     mov.u32 %tidy,%tid.y;
     cvt.u64.u32 %y64,%tidy;
     add.u64 %y64,%y64,1;
     cvta.shared.u64 %t_bcast,__oacc_bcast;
     mad.lo.u64 %r66,%y64,88,%t_bcast;
   }

   @ %r86 bra $L28;
   st.u32 [%r66+80],0;
  $L28:
   ret;
}
...

The ptx is called with 2 workers and 128 vector_length.

So, 2 workers mean %tid.y has values 0 and 1.
Then %y64 has values 1 and 2.
Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + (2 * 88) + 80.

So we're accessing memory at location 256, while the __oacc_bcast is only 176 bytes big.

I formulated this assert that AFAIU detects this situation in the compiler:
...
@@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
    fprintf (file, "\t}\n");
  }

+static int nvptx_mach_max_workers ();
+
  /* Emit code to initialize OpenACC worker broadcast and synchronization
     registers.  */

@@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
                "// vector broadcast offset\n",
                REGNO (cfun->machine->bcast_partition),
                oacc_bcast_partition);
+      gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1) <= oacc_bcast_size);
      }
    if (cfun->machine->sync_bar)
      fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
...

The assert is not triggered when the fallback is used.

I've tracked the problem down to:
...
- if (oacc_bcast_size < data.offset) - oacc_bcast_size = data.offset; + if (oacc_bcast_partition < data.offset) + { + int psize = data.offset; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); + oacc_bcast_partition = psize; + oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + }
...

We hit this if clause for a first compiled function, with num_workers(1).

This sets oacc_bcast_partition and oacc_bcast_size as required for that functions.

Then we hit this if clause for a second compiled function, with num_workers (2).

We need oacc_bcast_size updated, but the 'oacc_bcast_partition < data.offset' is false, so the update doesn't happen.

I managed to fix this by making the code unconditional, and using MAX to update oacc_bcast_partition and oacc_bcast_size.

Thanks,
- Tom

Reply via email to