Re: [PATCH, 6/16] Add pass_oacc_kernels

Tom de Vries Thu, 19 Nov 2015 05:52:59 -0800

On 11/11/15 11:58, Richard Biener wrote:

On Mon, 9 Nov 2015, Tom de Vries wrote:

On 09/11/15 16:35, Tom de Vries wrote:

Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

       1    Insert new exit block only when needed in
          transform_to_exit_first_loop_alt
       2    Make create_parallel_loop return void
       3    Ignore reduction clause on kernels directive
       4    Implement -foffload-alias
       5    Add in_oacc_kernels_region in struct loop
       6    Add pass_oacc_kernels
       7    Add pass_dominator_oacc_kernels
       8    Add pass_ch_oacc_kernels
       9    Add pass_parallelize_loops_oacc_kernels
      10    Add pass_oacc_kernels pass group in passes.def
      11    Update testcases after adding kernels pass group
      12    Handle acc loop directive
      13    Add c-c++-common/goacc/kernels-*.c
      14    Add gfortran.dg/goacc/kernels-*.f95
      15    Add libgomp.oacc-c-c++-common/kernels-*.c
      16    Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.


this patchs add a pass group pass_oacc_kernels (which will be added to the
pass list as a whole in patch 10).


Just to understand (while also skimming the HSA patches).

You are basically relying on autopar for what the HSA patches call
"gridification"?  That is, OMP lowering produces loopy kernels
and autopar then will basically strip the outermost loop?


Short answer: no. In more detail...

Existing openmp support maps explictly independent loops (annotated withomp-for) in omp-parallel regions onto pthreads. It generates threadfunctions containing sequential loops that iterate on a subset of dataof the original loop.


Parloops maps sequential loops onto pthreads by:
- proving the loop is independent
- identifiying reductions
- rewriting the loop into an omp-for annotated loop
- wrapping the loop in an an omp-parallel region
- rewriting the variable accesses in the loop such that they are
  relative to base pointers passed into the region
  (note: this bit is done by omplower for omp-for loops from source)
- rewriting the preloop-read and postloop-write pair of a reduction
  variable into an atomic update
- letting a subsequent ompexpand expand the omp-for and omp-parallel

The HSA support maps explicitly independent loops in openmp targetregions onto an shared memory accelerator. By default, it generateskernel functions containing sequential loops that iterate on a subset ofdata of the original loop. The control flow has a performance penalty onthe accelerator, so there's a concept called gridification (explainedhere: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00586.html ). [ I'mnot sure if it is an additional transformation or a different style ofgeneration ]. The gridification increases the launch dimensions of thekernels to a point that there's only one iteration left in the loop,which means that the control flow can be eliminated.

The openacc kernels support maps loops in an oacc kernels region onto anon-shared memory accelerator. These loops can be unannotated loops, oracc-loop annotated loops. If an acc-loop directive contains theindependent clause, the loop is explicitly independent.

The current oacc kernels implementation mostly ignores the acc-loopdirective, in order to unify handling of the annotated and unannotatedloop. The patch "Handle acc loop directive" (athttps://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html ) expands theannotated loop as sequential loop.At the point that we get to pass_parallelize_loops_oacc_kernels, we havesequential loops in an offloaded function (atm, there's no support forthe independent clause yet).

So pass_parallelize_loops_oacc_kernels transforms sequential loops in anoffloaded function originating from a kernels region into explicitlyindependent loops by:

- proving the loop is independent
- identifying reductions
- rewriting the loop into an acc-loop annotated loop
- annotating the offloaded function with kernel launch dimensions
- rewriting the preloop-load and postloop-store pair of a reduction
  variable into an atomic update
- letting a subsequent ompexpand expand the acc-loop

I'd say there's is no explicit gridification in there.

AFAIU, gridification is something that can result from determining thelauch dimensions of the offloaded function, and optimizing for thosedimensions. Currently pass_parallelize_loops_oacc_kernels is a placewhere we set launch dimensions, but we're not optimizing for that, thathappens later-on. (And I'm starting to wonder whether I can get rid ofthe setting of the gang dimension in pass_parallelize_loops_oacc_kernels).


Thanks,
- Tom

Re: [PATCH, 6/16] Add pass_oacc_kernels

Reply via email to