On 11/11/15 11:58, Richard Biener wrote:
On Mon, 9 Nov 2015, Tom de Vries wrote:

On 09/11/15 16:35, Tom de Vries wrote:
Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

       1    Insert new exit block only when needed in
          transform_to_exit_first_loop_alt
       2    Make create_parallel_loop return void
       3    Ignore reduction clause on kernels directive
       4    Implement -foffload-alias
       5    Add in_oacc_kernels_region in struct loop
       6    Add pass_oacc_kernels
       7    Add pass_dominator_oacc_kernels
       8    Add pass_ch_oacc_kernels
       9    Add pass_parallelize_loops_oacc_kernels
      10    Add pass_oacc_kernels pass group in passes.def
      11    Update testcases after adding kernels pass group
      12    Handle acc loop directive
      13    Add c-c++-common/goacc/kernels-*.c
      14    Add gfortran.dg/goacc/kernels-*.f95
      15    Add libgomp.oacc-c-c++-common/kernels-*.c
      16    Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.

this patchs add a pass group pass_oacc_kernels (which will be added to the
pass list as a whole in patch 10).

Just to understand (while also skimming the HSA patches).

You are basically relying on autopar for what the HSA patches call
"gridification"?  That is, OMP lowering produces loopy kernels
and autopar then will basically strip the outermost loop?

Short answer: no. In more detail...

Existing openmp support maps explictly independent loops (annotated with omp-for) in omp-parallel regions onto pthreads. It generates thread functions containing sequential loops that iterate on a subset of data of the original loop.

Parloops maps sequential loops onto pthreads by:
- proving the loop is independent
- identifiying reductions
- rewriting the loop into an omp-for annotated loop
- wrapping the loop in an an omp-parallel region
- rewriting the variable accesses in the loop such that they are
  relative to base pointers passed into the region
  (note: this bit is done by omplower for omp-for loops from source)
- rewriting the preloop-read and postloop-write pair of a reduction
  variable into an atomic update
- letting a subsequent ompexpand expand the omp-for and omp-parallel

The HSA support maps explicitly independent loops in openmp target regions onto an shared memory accelerator. By default, it generates kernel functions containing sequential loops that iterate on a subset of data of the original loop. The control flow has a performance penalty on the accelerator, so there's a concept called gridification (explained here: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00586.html ). [ I'm not sure if it is an additional transformation or a different style of generation ]. The gridification increases the launch dimensions of the kernels to a point that there's only one iteration left in the loop, which means that the control flow can be eliminated.

The openacc kernels support maps loops in an oacc kernels region onto a non-shared memory accelerator. These loops can be unannotated loops, or acc-loop annotated loops. If an acc-loop directive contains the independent clause, the loop is explicitly independent.

The current oacc kernels implementation mostly ignores the acc-loop directive, in order to unify handling of the annotated and unannotated loop. The patch "Handle acc loop directive" (at https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html ) expands the annotated loop as sequential loop. At the point that we get to pass_parallelize_loops_oacc_kernels, we have sequential loops in an offloaded function (atm, there's no support for the independent clause yet).

So pass_parallelize_loops_oacc_kernels transforms sequential loops in an offloaded function originating from a kernels region into explicitly independent loops by:
- proving the loop is independent
- identifying reductions
- rewriting the loop into an acc-loop annotated loop
- annotating the offloaded function with kernel launch dimensions
- rewriting the preloop-load and postloop-store pair of a reduction
  variable into an atomic update
- letting a subsequent ompexpand expand the acc-loop

I'd say there's is no explicit gridification in there.

AFAIU, gridification is something that can result from determining the lauch dimensions of the offloaded function, and optimizing for those dimensions. Currently pass_parallelize_loops_oacc_kernels is a place where we set launch dimensions, but we're not optimizing for that, that happens later-on. (And I'm starting to wonder whether I can get rid of the setting of the gang dimension in pass_parallelize_loops_oacc_kernels).

Thanks,
- Tom

Reply via email to