I'll be posting a patch series for trunk, which implements the core of the OpenACC execution model. This is split into the following patches:

01-trunk-unique.patch
  Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
  NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
  OpenACC hook
04-trunk-c.patch
  C FE changes
05-trunk-cxx.patch
  C++ FE changes
06-trunk-red-init.patch
  Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
  Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
  Device-specific lowering of loop markers
09-trunk-lower-gate.patch
  Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
  Libgomp change (remove dimension check)
11-trunk-tests.patch
  Initial set of execution tests
[let's try that again, after slapping my mail agent for using an old address]

With the exception of patch 6, these are all on the gomp4 branch. This patch set does not change reduction handling, which will be dealt with in a subsequent set.

An offloaded region is spawned on a set of execution engines. These are organized as a cube, with specific axes controlled by the programmer. The engines may operate in a 'partitioned' mode, where each engine executes as a separate thread, or they may operate in a 'single' mode, where one engine of a particular set executes the program and the other engines are idled (in an implementation-specific manner).

A driving example is the following:
#pragma acc parallel ...
{
 // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
  ... // partitioned mode here

if (expr) // single mode here
#pragma acc loop ...
  for (i = 0; i < N; i++) // loop 2
    ... // partitioned mode here
}

While it's clear all paths lead to loop 1, it's not statically determinable whether loop 2 is executed or not.

This implementation marks the head and tail of partitioned execution regions with builtin functions indicating the axes of partitioning. After device-specific lowering, these will eventually make it to RTL expansion time, where they get expanded to backend-specific RTL. In the PTX implementation 'single' mode is implemented by a 'neutering' mechanism, where the non-active execution engines skip each basic block and 'follow along' conditional branches to get to a subsequent block. In this manner all engines can reach a dynamically determinable partitioned region.

On entry to a partitioned region, we execute a 'fork' operation, cloning live state from the single active engine before the region, into the other threads that become activated.

This patchset has been tested on x86_64-linux & ptx accelerator.

nathan

Reply via email to