I'll be posting a patch series for trunk, which implements the core of the
OpenACC execution model. This is split into the following patches:
01-trunk-unique.patch
Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
OpenACC hook
04-trunk-c.patch
C FE changes
05-trunk-cxx.patch
C++ FE changes
06-trunk-red-init.patch
Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
Device-specific lowering of loop markers
09-trunk-lower-gate.patch
Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
Libgomp change (remove dimension check)
11-trunk-tests.patch
Initial set of execution tests
[let's try that again, after slapping my mail agent for using an old address]
With the exception of patch 6, these are all on the gomp4 branch. This patch
set does not change reduction handling, which will be dealt with in a subsequent
set.
An offloaded region is spawned on a set of execution engines. These are
organized as a cube, with specific axes controlled by the programmer. The
engines may operate in a 'partitioned' mode, where each engine executes as a
separate thread, or they may operate in a 'single' mode, where one engine of a
particular set executes the program and the other engines are idled (in an
implementation-specific manner).
A driving example is the following:
#pragma acc parallel ...
{
// single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
... // partitioned mode here
if (expr) // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 2
... // partitioned mode here
}
While it's clear all paths lead to loop 1, it's not statically determinable
whether loop 2 is executed or not.
This implementation marks the head and tail of partitioned execution regions
with builtin functions indicating the axes of partitioning. After
device-specific lowering, these will eventually make it to RTL expansion time,
where they get expanded to backend-specific RTL. In the PTX implementation
'single' mode is implemented by a 'neutering' mechanism, where the non-active
execution engines skip each basic block and 'follow along' conditional branches
to get to a subsequent block. In this manner all engines can reach a
dynamically determinable partitioned region.
On entry to a partitioned region, we execute a 'fork' operation, cloning live
state from the single active engine before the region, into the other threads
that become activated.
This patchset has been tested on x86_64-linux & ptx accelerator.
nathan