Hi,
in this email I'm trying to explain in detail what problem I'm running into with
reductions in oacc kernels region, and how I think it could be solved.
Any advice is welcome.
OVERALL PROBLEM
The overall problem I'm trying to solve is to implement the oacc kernels
directive in gcc, reusing pass_parallelize_loops.
OACC KERNELS
The oacc kernels region is a region with a series of loop nests, which are
intended to run on the accelerator. The compiler needs to offload each loop nest
to the accelerator, in the way most optimal for the accelerator.
PASS_PARALLELIZE_LOOPS
The pass analyzes loops. If the loop iterations are independent, and it looks
beneficial to parallelize the loop, the loop is transformed.
A copy of the loop is made, that deals with:
- small loop iterations for which the overhead of starting several threads will
be too big, or
- fixup loop iterations that are left in case the number of iterations is not
divisible by the parallelization factor.
The original loop is transformed:
- References of local variables are replaced with dereferences of a new
variable, which are initialized at loop entry with the addresses of the
original variables (eliminate_local_variables)
- copy loop-non-local variables to a structure, and replace references with
loads from a pointer to another (similar) structure
(seperate_decls_in_region)
- The loop is replaced with an GIMPLE_OMP_FOR (with and empty body) and
GIMPLE_OMP_CONTINUE
- The loop region is enveloped with GIMPLE_OMP_PARALLEL and GIMPLE_OMP_RETURN
- the loop region is omp-expanded using omp_expand_local
STATUS
I've created an initial implementation in vries/oacc-kernels, on top of the
gomp-4_0-branch.
GOMP-4_0-BRANCH
In the gomp-4_0-branch, the kernels directive is translated as a copy of the
oacc parallels directive. So, the following stages are done:
- pass_lower_omp/scan_omp:
- scan directive body for variables.
- build up omp_context datastructures.
- declare struct with fields corresponding to scanned variables.
- declare function with pointer to struct
- pass_lower_omp/lower_omp:
- declare struct
- assign values to struct fields
- declare pointer to struct
- rewrite body in terms of struct fields using pointer to struct.
- omp_expand:
- build up omp_region data-structures
- split off region in separate function
- replace region with call to oacc runtime function while passing function
pointer to split off function
VRIES/OACC-KERNELS
The current mechanism of offloading (compiling a function for a different
architecture) is using the lto-streaming. The parloops pass is located after the
lto-streaming point which is too late. OTOH, the parloops pass needs alias info,
which is only available after pass_build_ealias. So a copy of the parloops pass
specialized for oacc kernels has been added after pass_build_ealias (plus a
couple of passes to compensate for moving the pass up in the pass list).
The new pass does not use the lowering (first 2 steps of loop transform) of
parloops. The lowering is already done by pass_omp_lower.
The omp-expansion of the oacc-kernels region (done in gomp-4_0-branch) is
skipped, to allow first the alias analysis to work on the scope of the intact
function, and the new pass to do the omp-expansion.
So, the new pass:
- analyses the loop for dependences
- if independent, transforms the loop:
- The loop is replaced with an GIMPLE_OMP_FOR (kind_oacc_loop, with an empty
body) and GIMPLE_OMP_CONTINUE
- The GIMPLE_OACC_KERNELS is replaced with GIMPLE_OACC_PARALLEL
- the loop region is omp-expanded using omp_expand_local
The gotchas of the implementation are:
- no support for reductions, nested loops, more than one loop nest in
kernels region
- the fixup/low-it-count loop copy is still generated _inside_ the split off
function
PROBLEM WITH REDUCTIONS
In the vries/oacc-kernels implementation, the lowering of oacc kernels (in
pass_lower_omp) is done before any loop analysis. For reductions, that's not
possible anymore, since that would mean that detection of reductions comes after
handling of reductions.
The problem we're running into here, is that:
- on one hand, the oacc lowering is done on high gimple (scopes still intact
because GIMPLE_BINDs are still present, no bbs and cfgs, eh not expanded, no
ssa),
- otoh, loop analysis is done on low ssa gimple (bbs, cfgs, ssa, no scopes, eh
expanded)
The parloops pass is confronted with a similar problem.
AFAIU, ideal pass reuse for parloops would go something like this: on ssa, you
do loop analysis. You then insert omp pragmas that indicate what transformations
you want. Then you go back from ssa gimple to high gimple representation, and
you run omp-lower and omp-expand to do the actual transformations.
Things have been solved like this in parloops: the lowering of omp-lower is not
reused in parloops, but instead a different (but similar) lowering has been
added. What is reused, is the omp-expand. We don't go back to pre-ssa, but the
omp-expand code has been adapted to handle ssa code. And the parloops pass
removes the loop cfg part and substitutes it for a GIMPLE_OMP_FOR, as would be
the case for omp for directives present in a source with omp directives.
An advantage that parloops has, is that its scope is limited to a single loop.
OTOH, the implementation for oacc kernels has to cooperate with other oacc
constructs, f.i. if an array is already present on the device due to an earlier
construct.
Furthermore, the lowering styles are different.
In the omp-lower pass, oacc lowering is done like this: we load from a struct
field a pointer (D.2158), from which we load the thread-local pointer
instantiation (c.3D.2134):
...
D.2158 = .omp_data_iD.2148->cD.2150;
c.3D.2134 = *D.2158;
D.2137 = c.3D.2134 + D.2136;
*D.2137 = D.2144;
...
In the parloops pass, we simply load the thread-local pointer instantiation from
a struct field:
...
c.4_53 = .paral_data_load.9_56->c;
...
_7 = c.4_53 + _5;
*_7 = _14;
...
POSSIBLE SOLUTIONS
I can think of these possible solutions:
1. rewrite reduction analysis from parloops to work on high gimple.
2. rewrite omp-lowering to work on ssa gimple
3. try to reuse parloops lowering, and patch it up such that it coorporates with
code generated for other oacc constructs.
At the moment, I'm looking at the last option.
By lowering the kernels directive and its clauses at omp-low, but the associated
body at parloops, I get the following code (after parloops lowering) for a
vector add loop:
...
<bb 8>:
.omp_data_arr.10.c = &c;
.omp_data_arr.10.b = &b;
.omp_data_arr.10.a = &a;
#pragma acc kernels map(from:c [len: 8]) map(to:b [len: 8]) map(to:a [len:
8]) [child fn: main._omp_fn.0 (.omp_data_arr.10, .omp_data_sizes.11,
.omp_data_kinds.12)]
<bb 9>:
.omp_data_i_38 = &.omp_data_arr.10;
c.5_40 = c;
a.6_44 = a;
b.7_49 = b;
...
I think that if I manage to replace the last 3 with this I could have the vector
add loop working:
...
c_p = .omp_data_i_38->c;
a_p = .omp_data_i_38->a;
b_p = .omp_data_i_38->b;
c.5_40 = *c_p;
a.6_44 = *a_p;
b.7_49 = *b_p;
...
From there on, I'll try to get reductions working in a similar fashion.
Thanks,
- Tom