[OpenACC 0/11] execution model

Nathan Sidwell Wed, 21 Oct 2015 11:46:54 -0700

I'll be posting a patch series for trunk, which implements the core of theOpenACC execution model. This is split into the following patches:


01-trunk-unique.patch
  Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
  NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
  OpenACC hook
04-trunk-c.patch
  C FE changes
05-trunk-cxx.patch
  C++ FE changes
06-trunk-red-init.patch
  Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
  Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
  Device-specific lowering of loop markers
09-trunk-lower-gate.patch
  Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
  Libgomp change (remove dimension check)
11-trunk-tests.patch
  Initial set of execution tests
[let's try that again, after slapping my mail agent for using an old address]

With the exception of patch 6, these are all on the gomp4 branch. This patchset does not change reduction handling, which will be dealt with in a subsequentset.

An offloaded region is spawned on a set of execution engines. These areorganized as a cube, with specific axes controlled by the programmer. Theengines may operate in a 'partitioned' mode, where each engine executes as aseparate thread, or they may operate in a 'single' mode, where one engine of aparticular set executes the program and the other engines are idled (in animplementation-specific manner).


A driving example is the following:
#pragma acc parallel ...
{
 // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
  ... // partitioned mode here

if (expr) // single mode here
#pragma acc loop ...
  for (i = 0; i < N; i++) // loop 2
    ... // partitioned mode here
}

While it's clear all paths lead to loop 1, it's not statically determinablewhether loop 2 is executed or not.

This implementation marks the head and tail of partitioned execution regionswith builtin functions indicating the axes of partitioning. Afterdevice-specific lowering, these will eventually make it to RTL expansion time,where they get expanded to backend-specific RTL. In the PTX implementation'single' mode is implemented by a 'neutering' mechanism, where the non-activeexecution engines skip each basic block and 'follow along' conditional branchesto get to a subsequent block. In this manner all engines can reach adynamically determinable partitioned region.

On entry to a partitioned region, we execute a 'fork' operation, cloning livestate from the single active engine before the region, into the other threadsthat become activated.


This patchset has been tested on x86_64-linux & ptx accelerator.

nathan

[OpenACC 0/11] execution model

Reply via email to