oacc kernels directive -- reductions

Tom de Vries Tue, 14 Oct 2014 02:37:55 -0700

Hi,

in this email I'm trying to explain in detail what problem I'm running into withreductions in oacc kernels region, and how I think it could be solved.


Any advice is welcome.


OVERALL PROBLEM

The overall problem I'm trying to solve is to implement the oacc kernelsdirective in gcc, reusing pass_parallelize_loops.



OACC KERNELS

The oacc kernels region is a region with a series of loop nests, which areintended to run on the accelerator. The compiler needs to offload each loop nestto the accelerator, in the way most optimal for the accelerator.



PASS_PARALLELIZE_LOOPS

The pass analyzes loops. If the loop iterations are independent, and it looksbeneficial to parallelize the loop, the loop is transformed.


A copy of the loop is made, that deals with:
- small loop iterations for which the overhead of starting several threads will
  be too big, or
- fixup loop iterations that are left in case the number of iterations is not
  divisible by the parallelization factor.

The original loop is transformed:
- References of local variables are replaced with dereferences of a new
  variable, which are initialized at loop entry with the addresses of the
  original variables (eliminate_local_variables)
- copy loop-non-local variables to a structure, and replace references with
  loads from a pointer to another (similar) structure
  (seperate_decls_in_region)
- The loop is replaced with an GIMPLE_OMP_FOR (with and empty body) and
  GIMPLE_OMP_CONTINUE
- The loop region is enveloped with GIMPLE_OMP_PARALLEL and GIMPLE_OMP_RETURN
- the loop region is omp-expanded using omp_expand_local


STATUS

I've created an initial implementation in vries/oacc-kernels, on top of thegomp-4_0-branch.



GOMP-4_0-BRANCH

In the gomp-4_0-branch, the kernels directive is translated as a copy of theoacc parallels directive. So, the following stages are done:

- pass_lower_omp/scan_omp:
  - scan directive body for variables.
  - build up omp_context datastructures.
  - declare struct with fields corresponding to scanned variables.
  - declare function with pointer to struct
- pass_lower_omp/lower_omp:
  - declare struct
  - assign values to struct fields
  - declare pointer to struct
  - rewrite body in terms of struct fields using pointer to struct.
- omp_expand:
  - build up omp_region data-structures
  - split off region in separate function
  - replace region with call to oacc runtime function while passing function
    pointer to split off function


VRIES/OACC-KERNELS

The current mechanism of offloading (compiling a function for a differentarchitecture) is using the lto-streaming. The parloops pass is located after thelto-streaming point which is too late. OTOH, the parloops pass needs alias info,which is only available after pass_build_ealias. So a copy of the parloops passspecialized for oacc kernels has been added after pass_build_ealias (plus acouple of passes to compensate for moving the pass up in the pass list).

The new pass does not use the lowering (first 2 steps of loop transform) ofparloops. The lowering is already done by pass_omp_lower.

The omp-expansion of the oacc-kernels region (done in gomp-4_0-branch) isskipped, to allow first the alias analysis to work on the scope of the intactfunction, and the new pass to do the omp-expansion.


So, the new pass:
- analyses the loop for dependences
- if independent, transforms the loop:
  - The loop is replaced with an GIMPLE_OMP_FOR (kind_oacc_loop, with an empty
    body) and GIMPLE_OMP_CONTINUE
  - The GIMPLE_OACC_KERNELS is replaced with GIMPLE_OACC_PARALLEL
  - the loop region is omp-expanded using omp_expand_local

The gotchas of the implementation are:
- no support for reductions, nested loops, more than one loop nest in
  kernels region
- the fixup/low-it-count loop copy is still generated _inside_ the split off
  function


PROBLEM WITH REDUCTIONS

In the vries/oacc-kernels implementation, the lowering of oacc kernels (inpass_lower_omp) is done before any loop analysis. For reductions, that's notpossible anymore, since that would mean that detection of reductions comes afterhandling of reductions.


The problem we're running into here, is that:
- on one hand, the oacc lowering is done on high gimple (scopes still intact
  because GIMPLE_BINDs are still present, no bbs and cfgs, eh not expanded, no
  ssa),
- otoh, loop analysis is done on low ssa gimple (bbs, cfgs, ssa, no scopes, eh
  expanded)

The parloops pass is confronted with a similar problem.

AFAIU, ideal pass reuse for parloops would go something like this: on ssa, youdo loop analysis. You then insert omp pragmas that indicate what transformationsyou want. Then you go back from ssa gimple to high gimple representation, andyou run omp-lower and omp-expand to do the actual transformations.

Things have been solved like this in parloops: the lowering of omp-lower is notreused in parloops, but instead a different (but similar) lowering has beenadded. What is reused, is the omp-expand. We don't go back to pre-ssa, but theomp-expand code has been adapted to handle ssa code. And the parloops passremoves the loop cfg part and substitutes it for a GIMPLE_OMP_FOR, as would bethe case for omp for directives present in a source with omp directives.

An advantage that parloops has, is that its scope is limited to a single loop.OTOH, the implementation for oacc kernels has to cooperate with other oaccconstructs, f.i. if an array is already present on the device due to an earlierconstruct.


Furthermore, the lowering styles are different.

In the omp-lower pass, oacc lowering is done like this: we load from a structfield a pointer (D.2158), from which we load the thread-local pointerinstantiation (c.3D.2134):

...
  D.2158 = .omp_data_iD.2148->cD.2150;
  c.3D.2134 = *D.2158;
  D.2137 = c.3D.2134 + D.2136;
  *D.2137 = D.2144;
...

In the parloops pass, we simply load the thread-local pointer instantiation froma struct field:

...
  c.4_53 = .paral_data_load.9_56->c;

  ...

  _7 = c.4_53 + _5;
  *_7 = _14;
...


POSSIBLE SOLUTIONS

I can think of these possible solutions:

1. rewrite reduction analysis from parloops to work on high gimple.

2. rewrite omp-lowering to work on ssa gimple

3. try to reuse parloops lowering, and patch it up such that it coorporates withcode generated for other oacc constructs.



At the moment, I'm looking at the last option.

By lowering the kernels directive and its clauses at omp-low, but the associatedbody at parloops, I get the following code (after parloops lowering) for avector add loop:

...
  <bb 8>:
  .omp_data_arr.10.c = &c;
  .omp_data_arr.10.b = &b;
  .omp_data_arr.10.a = &a;

#pragma acc kernels map(from:c [len: 8]) map(to:b [len: 8]) map(to:a [len:8]) [child fn: main._omp_fn.0 (.omp_data_arr.10, .omp_data_sizes.11,.omp_data_kinds.12)]


  <bb 9>:
  .omp_data_i_38 = &.omp_data_arr.10;
  c.5_40 = c;
  a.6_44 = a;
  b.7_49 = b;
...

I think that if I manage to replace the last 3 with this I could have the vectoradd loop working:

...
  c_p = .omp_data_i_38->c;
  a_p = .omp_data_i_38->a;
  b_p = .omp_data_i_38->b;
  c.5_40 = *c_p;
  a.6_44 = *a_p;
  b.7_49 = *b_p;
...

From there on, I'll try to get reductions working in a similar fashion.

Thanks,
- Tom

oacc kernels directive -- reductions

Reply via email to