On Tue, 14 Oct 2014, Tom de Vries wrote:

> Hi,
> 
> in this email I'm trying to explain in detail what problem I'm running into
> with reductions in oacc kernels region, and how I think it could be solved.
> 
> Any advice is welcome.
> 
> 
> OVERALL PROBLEM
> 
> The overall problem I'm trying to solve is to implement the oacc kernels
> directive in gcc, reusing pass_parallelize_loops.
> 
> 
> OACC KERNELS
> 
> The oacc kernels region is a region with a series of loop nests, which are
> intended to run on the accelerator. The compiler needs to offload each loop
> nest to the accelerator, in the way most optimal for the accelerator.
> 
> 
> PASS_PARALLELIZE_LOOPS
> 
> The pass analyzes loops. If the loop iterations are independent, and it looks
> beneficial to parallelize the loop, the loop is transformed.
> 
> A copy of the loop is made, that deals with:
> - small loop iterations for which the overhead of starting several threads
> will
>   be too big, or
> - fixup loop iterations that are left in case the number of iterations is not
>   divisible by the parallelization factor.
> 
> The original loop is transformed:
> - References of local variables are replaced with dereferences of a new
>   variable, which are initialized at loop entry with the addresses of the
>   original variables (eliminate_local_variables)
> - copy loop-non-local variables to a structure, and replace references with
>   loads from a pointer to another (similar) structure
>   (seperate_decls_in_region)
> - The loop is replaced with an GIMPLE_OMP_FOR (with and empty body) and
>   GIMPLE_OMP_CONTINUE
> - The loop region is enveloped with GIMPLE_OMP_PARALLEL and GIMPLE_OMP_RETURN
> - the loop region is omp-expanded using omp_expand_local
> 
> 
> STATUS
> 
> I've created an initial implementation in vries/oacc-kernels, on top of the
> gomp-4_0-branch.
> 
> 
> GOMP-4_0-BRANCH
> 
> In the gomp-4_0-branch, the kernels directive is translated as a copy of the
> oacc parallels directive.  So, the following stages are done:
> - pass_lower_omp/scan_omp:
>   - scan directive body for variables.
>   - build up omp_context datastructures.
>   - declare struct with fields corresponding to scanned variables.
>   - declare function with pointer to struct
> - pass_lower_omp/lower_omp:
>   - declare struct
>   - assign values to struct fields
>   - declare pointer to struct
>   - rewrite body in terms of struct fields using pointer to struct.
> - omp_expand:
>   - build up omp_region data-structures
>   - split off region in separate function
>   - replace region with call to oacc runtime function while passing function
>     pointer to split off function
> 
> 
> VRIES/OACC-KERNELS
> 
> The current mechanism of offloading (compiling a function for a different
> architecture) is using the lto-streaming. The parloops pass is located after
> the lto-streaming point which is too late. OTOH, the parloops pass needs alias
> info, which is only available after pass_build_ealias. So a copy of the
> parloops pass specialized for oacc kernels has been added after
> pass_build_ealias (plus a couple of passes to compensate for moving the pass
> up in the pass list).
> 
> The new pass does not use the lowering (first 2 steps of loop transform) of
> parloops. The lowering is already done by pass_omp_lower.
> 
> The omp-expansion of the oacc-kernels region (done in gomp-4_0-branch) is
> skipped, to allow first the alias analysis to work on the scope of the intact
> function, and the new pass to do the omp-expansion.
> 
> So, the new pass:
> - analyses the loop for dependences
> - if independent, transforms the loop:
>   - The loop is replaced with an GIMPLE_OMP_FOR (kind_oacc_loop, with an empty
>     body) and GIMPLE_OMP_CONTINUE
>   - The GIMPLE_OACC_KERNELS is replaced with GIMPLE_OACC_PARALLEL
>   - the loop region is omp-expanded using omp_expand_local
> 
> The gotchas of the implementation are:
> - no support for reductions, nested loops, more than one loop nest in
>   kernels region
> - the fixup/low-it-count loop copy is still generated _inside_ the split off
>   function
> 
> 
> PROBLEM WITH REDUCTIONS
> 
> In the vries/oacc-kernels implementation, the lowering of oacc kernels (in
> pass_lower_omp) is done before any loop analysis. For reductions, that's not
> possible anymore, since that would mean that detection of reductions comes
> after handling of reductions.
> 
> The problem we're running into here, is that:
> - on one hand, the oacc lowering is done on high gimple (scopes still intact
>   because GIMPLE_BINDs are still present, no bbs and cfgs, eh not expanded, no
>   ssa),
> - otoh, loop analysis is done on low ssa gimple (bbs, cfgs, ssa, no scopes, eh
>   expanded)
> 
> The parloops pass is confronted with a similar problem.
> 
> AFAIU, ideal pass reuse for parloops would go something like this: on ssa, you
> do loop analysis. You then insert omp pragmas that indicate what
> transformations you want. Then you go back from ssa gimple to high gimple
> representation, and you run omp-lower and omp-expand to do the actual
> transformations.
> 
> Things have been solved like this in parloops: the lowering of omp-lower is
> not reused in parloops, but instead  a different (but similar) lowering has
> been added. What is reused, is the omp-expand. We don't go back to pre-ssa,
> but the omp-expand code has been adapted to handle ssa code. And the parloops
> pass removes the loop cfg part and substitutes it for a GIMPLE_OMP_FOR, as
> would be the case for omp for directives present in a source with omp
> directives.
> 
> An advantage that parloops has, is that its scope is limited to a single loop.
> OTOH, the implementation for oacc kernels has to cooperate with other oacc
> constructs, f.i. if an array is already present on the device due to an
> earlier construct.
> 
> Furthermore, the lowering styles are different.
> 
> In the omp-lower pass, oacc lowering is done like this: we load from a struct
> field a pointer (D.2158), from which we load the thread-local pointer
> instantiation (c.3D.2134):
> ...
>   D.2158 = .omp_data_iD.2148->cD.2150;
>   c.3D.2134 = *D.2158;
>   D.2137 = c.3D.2134 + D.2136;
>   *D.2137 = D.2144;
> ...
> 
> In the parloops pass, we simply load the thread-local pointer instantiation
> from a struct field:
> ...
>   c.4_53 = .paral_data_load.9_56->c;
> 
>   ...
> 
>   _7 = c.4_53 + _5;
>   *_7 = _14;
> ...
> 
> 
> POSSIBLE SOLUTIONS
> 
> I can think of these possible solutions:
> 
> 1. rewrite reduction analysis from parloops to work on high gimple.
> 
> 2. rewrite omp-lowering to work on ssa gimple
> 
> 3. try to reuse parloops lowering, and patch it up such that it coorporates
> with code generated for other oacc constructs.
> 
> 
> At the moment, I'm looking at the last option.
> 
> By lowering the kernels directive and its clauses at omp-low, but the
> associated body at parloops, I get the following code (after parloops
> lowering) for a vector add loop:
> ...
>   <bb 8>:
>   .omp_data_arr.10.c = &c;
>   .omp_data_arr.10.b = &b;
>   .omp_data_arr.10.a = &a;
>   #pragma acc kernels map(from:c [len: 8]) map(to:b [len: 8]) map(to:a [len:
> 8]) [child fn: main._omp_fn.0 (.omp_data_arr.10, .omp_data_sizes.11,
> .omp_data_kinds.12)]
> 
>   <bb 9>:
>   .omp_data_i_38 = &.omp_data_arr.10;
>   c.5_40 = c;
>   a.6_44 = a;
>   b.7_49 = b;
> ...
> 
> I think that if I manage to replace the last 3 with this I could have the
> vector add loop working:
> ...
>   c_p = .omp_data_i_38->c;
>   a_p = .omp_data_i_38->a;
>   b_p = .omp_data_i_38->b;
>   c.5_40 = *c_p;
>   a.6_44 = *a_p;
>   b.7_49 = *b_p;
> ...
> 
> From there on, I'll try to get reductions working in a similar fashion.

Not really understanding how 3) can replace 1) or 2) ... but what
I understand is that OACC lowering happens at pass_lower_omp
(no CFG or loops or SSA).  If we want to keep that the "proper"
choice of dealing with this high-level OACC "kernel" directive
is to pass it down somehow and deal with it later when loops + SSA
are available.  After all you _are_ re-using parloops for the
"lowering".

So - can't OACC kernel lowering at lower_omp time simply annotate loops?
Like in a way #pragma ivdep is handled?  Maybe this should even happen
from inside the parser (who knows where loops are)?  Then at CFG / loop
build time this information is transfered to loop meta-data (same
as with IVDEP) and a parloop pass somewhere in early opts can do
the right thing(TM) on the marked loops?

Thanks,
Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

Reply via email to