On 11/11/15 11:58, Richard Biener wrote:
On Mon, 9 Nov 2015, Tom de Vries wrote:
On 09/11/15 16:35, Tom de Vries wrote:
Hi,
this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.
The patch series contains these patches:
1 Insert new exit block only when needed in
transform_to_exit_first_loop_alt
2 Make create_parallel_loop return void
3 Ignore reduction clause on kernels directive
4 Implement -foffload-alias
5 Add in_oacc_kernels_region in struct loop
6 Add pass_oacc_kernels
7 Add pass_dominator_oacc_kernels
8 Add pass_ch_oacc_kernels
9 Add pass_parallelize_loops_oacc_kernels
10 Add pass_oacc_kernels pass group in passes.def
11 Update testcases after adding kernels pass group
12 Handle acc loop directive
13 Add c-c++-common/goacc/kernels-*.c
14 Add gfortran.dg/goacc/kernels-*.f95
15 Add libgomp.oacc-c-c++-common/kernels-*.c
16 Add libgomp.oacc-fortran/kernels-*.f95
The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.
Bootstrapped and reg-tested on x86_64.
Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
I'll post the individual patches in reply to this message.
this patchs add a pass group pass_oacc_kernels (which will be added to the
pass list as a whole in patch 10).
Just to understand (while also skimming the HSA patches).
You are basically relying on autopar for what the HSA patches call
"gridification"? That is, OMP lowering produces loopy kernels
and autopar then will basically strip the outermost loop?
Short answer: no. In more detail...
Existing openmp support maps explictly independent loops (annotated with
omp-for) in omp-parallel regions onto pthreads. It generates thread
functions containing sequential loops that iterate on a subset of data
of the original loop.
Parloops maps sequential loops onto pthreads by:
- proving the loop is independent
- identifiying reductions
- rewriting the loop into an omp-for annotated loop
- wrapping the loop in an an omp-parallel region
- rewriting the variable accesses in the loop such that they are
relative to base pointers passed into the region
(note: this bit is done by omplower for omp-for loops from source)
- rewriting the preloop-read and postloop-write pair of a reduction
variable into an atomic update
- letting a subsequent ompexpand expand the omp-for and omp-parallel
The HSA support maps explicitly independent loops in openmp target
regions onto an shared memory accelerator. By default, it generates
kernel functions containing sequential loops that iterate on a subset of
data of the original loop. The control flow has a performance penalty on
the accelerator, so there's a concept called gridification (explained
here: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00586.html ). [ I'm
not sure if it is an additional transformation or a different style of
generation ]. The gridification increases the launch dimensions of the
kernels to a point that there's only one iteration left in the loop,
which means that the control flow can be eliminated.
The openacc kernels support maps loops in an oacc kernels region onto a
non-shared memory accelerator. These loops can be unannotated loops, or
acc-loop annotated loops. If an acc-loop directive contains the
independent clause, the loop is explicitly independent.
The current oacc kernels implementation mostly ignores the acc-loop
directive, in order to unify handling of the annotated and unannotated
loop. The patch "Handle acc loop directive" (at
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html ) expands the
annotated loop as sequential loop.
At the point that we get to pass_parallelize_loops_oacc_kernels, we have
sequential loops in an offloaded function (atm, there's no support for
the independent clause yet).
So pass_parallelize_loops_oacc_kernels transforms sequential loops in an
offloaded function originating from a kernels region into explicitly
independent loops by:
- proving the loop is independent
- identifying reductions
- rewriting the loop into an acc-loop annotated loop
- annotating the offloaded function with kernel launch dimensions
- rewriting the preloop-load and postloop-store pair of a reduction
variable into an atomic update
- letting a subsequent ompexpand expand the acc-loop
I'd say there's is no explicit gridification in there.
AFAIU, gridification is something that can result from determining the
lauch dimensions of the offloaded function, and optimizing for those
dimensions. Currently pass_parallelize_loops_oacc_kernels is a place
where we set launch dimensions, but we're not optimizing for that, that
happens later-on. (And I'm starting to wonder whether I can get rid of
the setting of the gang dimension in pass_parallelize_loops_oacc_kernels).
Thanks,
- Tom