On Tue, 14 Oct 2014, Tom de Vries wrote: > Hi, > > in this email I'm trying to explain in detail what problem I'm running into > with reductions in oacc kernels region, and how I think it could be solved. > > Any advice is welcome. > > > OVERALL PROBLEM > > The overall problem I'm trying to solve is to implement the oacc kernels > directive in gcc, reusing pass_parallelize_loops. > > > OACC KERNELS > > The oacc kernels region is a region with a series of loop nests, which are > intended to run on the accelerator. The compiler needs to offload each loop > nest to the accelerator, in the way most optimal for the accelerator. > > > PASS_PARALLELIZE_LOOPS > > The pass analyzes loops. If the loop iterations are independent, and it looks > beneficial to parallelize the loop, the loop is transformed. > > A copy of the loop is made, that deals with: > - small loop iterations for which the overhead of starting several threads > will > be too big, or > - fixup loop iterations that are left in case the number of iterations is not > divisible by the parallelization factor. > > The original loop is transformed: > - References of local variables are replaced with dereferences of a new > variable, which are initialized at loop entry with the addresses of the > original variables (eliminate_local_variables) > - copy loop-non-local variables to a structure, and replace references with > loads from a pointer to another (similar) structure > (seperate_decls_in_region) > - The loop is replaced with an GIMPLE_OMP_FOR (with and empty body) and > GIMPLE_OMP_CONTINUE > - The loop region is enveloped with GIMPLE_OMP_PARALLEL and GIMPLE_OMP_RETURN > - the loop region is omp-expanded using omp_expand_local > > > STATUS > > I've created an initial implementation in vries/oacc-kernels, on top of the > gomp-4_0-branch. > > > GOMP-4_0-BRANCH > > In the gomp-4_0-branch, the kernels directive is translated as a copy of the > oacc parallels directive. So, the following stages are done: > - pass_lower_omp/scan_omp: > - scan directive body for variables. > - build up omp_context datastructures. > - declare struct with fields corresponding to scanned variables. > - declare function with pointer to struct > - pass_lower_omp/lower_omp: > - declare struct > - assign values to struct fields > - declare pointer to struct > - rewrite body in terms of struct fields using pointer to struct. > - omp_expand: > - build up omp_region data-structures > - split off region in separate function > - replace region with call to oacc runtime function while passing function > pointer to split off function > > > VRIES/OACC-KERNELS > > The current mechanism of offloading (compiling a function for a different > architecture) is using the lto-streaming. The parloops pass is located after > the lto-streaming point which is too late. OTOH, the parloops pass needs alias > info, which is only available after pass_build_ealias. So a copy of the > parloops pass specialized for oacc kernels has been added after > pass_build_ealias (plus a couple of passes to compensate for moving the pass > up in the pass list). > > The new pass does not use the lowering (first 2 steps of loop transform) of > parloops. The lowering is already done by pass_omp_lower. > > The omp-expansion of the oacc-kernels region (done in gomp-4_0-branch) is > skipped, to allow first the alias analysis to work on the scope of the intact > function, and the new pass to do the omp-expansion. > > So, the new pass: > - analyses the loop for dependences > - if independent, transforms the loop: > - The loop is replaced with an GIMPLE_OMP_FOR (kind_oacc_loop, with an empty > body) and GIMPLE_OMP_CONTINUE > - The GIMPLE_OACC_KERNELS is replaced with GIMPLE_OACC_PARALLEL > - the loop region is omp-expanded using omp_expand_local > > The gotchas of the implementation are: > - no support for reductions, nested loops, more than one loop nest in > kernels region > - the fixup/low-it-count loop copy is still generated _inside_ the split off > function > > > PROBLEM WITH REDUCTIONS > > In the vries/oacc-kernels implementation, the lowering of oacc kernels (in > pass_lower_omp) is done before any loop analysis. For reductions, that's not > possible anymore, since that would mean that detection of reductions comes > after handling of reductions. > > The problem we're running into here, is that: > - on one hand, the oacc lowering is done on high gimple (scopes still intact > because GIMPLE_BINDs are still present, no bbs and cfgs, eh not expanded, no > ssa), > - otoh, loop analysis is done on low ssa gimple (bbs, cfgs, ssa, no scopes, eh > expanded) > > The parloops pass is confronted with a similar problem. > > AFAIU, ideal pass reuse for parloops would go something like this: on ssa, you > do loop analysis. You then insert omp pragmas that indicate what > transformations you want. Then you go back from ssa gimple to high gimple > representation, and you run omp-lower and omp-expand to do the actual > transformations. > > Things have been solved like this in parloops: the lowering of omp-lower is > not reused in parloops, but instead a different (but similar) lowering has > been added. What is reused, is the omp-expand. We don't go back to pre-ssa, > but the omp-expand code has been adapted to handle ssa code. And the parloops > pass removes the loop cfg part and substitutes it for a GIMPLE_OMP_FOR, as > would be the case for omp for directives present in a source with omp > directives. > > An advantage that parloops has, is that its scope is limited to a single loop. > OTOH, the implementation for oacc kernels has to cooperate with other oacc > constructs, f.i. if an array is already present on the device due to an > earlier construct. > > Furthermore, the lowering styles are different. > > In the omp-lower pass, oacc lowering is done like this: we load from a struct > field a pointer (D.2158), from which we load the thread-local pointer > instantiation (c.3D.2134): > ... > D.2158 = .omp_data_iD.2148->cD.2150; > c.3D.2134 = *D.2158; > D.2137 = c.3D.2134 + D.2136; > *D.2137 = D.2144; > ... > > In the parloops pass, we simply load the thread-local pointer instantiation > from a struct field: > ... > c.4_53 = .paral_data_load.9_56->c; > > ... > > _7 = c.4_53 + _5; > *_7 = _14; > ... > > > POSSIBLE SOLUTIONS > > I can think of these possible solutions: > > 1. rewrite reduction analysis from parloops to work on high gimple. > > 2. rewrite omp-lowering to work on ssa gimple > > 3. try to reuse parloops lowering, and patch it up such that it coorporates > with code generated for other oacc constructs. > > > At the moment, I'm looking at the last option. > > By lowering the kernels directive and its clauses at omp-low, but the > associated body at parloops, I get the following code (after parloops > lowering) for a vector add loop: > ... > <bb 8>: > .omp_data_arr.10.c = &c; > .omp_data_arr.10.b = &b; > .omp_data_arr.10.a = &a; > #pragma acc kernels map(from:c [len: 8]) map(to:b [len: 8]) map(to:a [len: > 8]) [child fn: main._omp_fn.0 (.omp_data_arr.10, .omp_data_sizes.11, > .omp_data_kinds.12)] > > <bb 9>: > .omp_data_i_38 = &.omp_data_arr.10; > c.5_40 = c; > a.6_44 = a; > b.7_49 = b; > ... > > I think that if I manage to replace the last 3 with this I could have the > vector add loop working: > ... > c_p = .omp_data_i_38->c; > a_p = .omp_data_i_38->a; > b_p = .omp_data_i_38->b; > c.5_40 = *c_p; > a.6_44 = *a_p; > b.7_49 = *b_p; > ... > > From there on, I'll try to get reductions working in a similar fashion.
Not really understanding how 3) can replace 1) or 2) ... but what I understand is that OACC lowering happens at pass_lower_omp (no CFG or loops or SSA). If we want to keep that the "proper" choice of dealing with this high-level OACC "kernel" directive is to pass it down somehow and deal with it later when loops + SSA are available. After all you _are_ re-using parloops for the "lowering". So - can't OACC kernel lowering at lower_omp time simply annotate loops? Like in a way #pragma ivdep is handled? Maybe this should even happen from inside the parser (who knows where loops are)? Then at CFG / loop build time this information is transfered to loop meta-data (same as with IVDEP) and a parloop pass somewhere in early opts can do the right thing(TM) on the marked loops? Thanks, Richard. > Thanks, > - Tom > > -- Richard Biener <rguent...@suse.de> SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer