Hello all,
Recently, it was created branch for OpenACC
(http://gcc.gnu.org/ml/gcc/2013-09/msg00235.html)
As had been promised, I provide design notes, that describe our
understanding of transformations of OpenACC constructs to OpenCL ones.
That's only the design draft and may be changed in future.
OpenACC to OpenCL transformation
Contents
1 Overview
2 Parallel
2.1 If
2.2 Async
2.3 Num_gangs and num_workers
2.4 Vector_length
2.5 Private and firstprivate
2.6 Reduction
3 Kernels
3.1 If and async
3.2 Num_gangs, num_workers and vector_length
3.3 Several statements
4 Data
4.1 If
4.2 Copy and create
4.3 Present
5 Host_data
6 Loop
6.1 Collapse
6.2 Gang, vector and worker
6.3 Seq and independent
6.4 Private and reduction
7 Cache
8 Update
1 Overview
This page describes OpenCL function call sequences related to each
OpenACC directive or clause. The content is only design draft.
2 Parallel
Parallel construct says to the compiler that the whole region should be
run on the target in parallel. It does not mean that any loop from the
region must be parallelizable.
Let the code to compile is like the following.
INTEGER :: i
REAL :: a(SIZE), b(SIZE), c(SIZE)
!$ACC PARALLEL
DO i = 1, size
c(i) = a(i) / b(i)
ENDDO
!$ACC END PARALLEL
Where SIZE in smaller than workgroup (gang) size. Since variables are
not declared in data clauses, the compiler must add present_or_copy(a,
b, c) clause (however, PGI compiler generates present_or_copyin(a, b)
and present_or_copyout(c) clauses, since there is no reading from c and
storing to a, b). Also it must insert an implicit barrier to the end of
the parallel region since async clause does not present.
Generated OpenCL call sequence is like the following:
// ...
// Initialization
// ...
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
buffer_b = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
buffer_c = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
// Checking copyin presence
// ...
// Write data to the target
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
clEnqueueWriteBuffer(queue, buffer_b, 0, 0, SIZE * SIZEOF_REAL, b,
NULL, NULL);
clEnqueueWriteBuffer(queue, buffer_c, 0, 0, SIZE * SIZEOF_REAL, c,
NULL, NULL);
// Building kernel
// ...
// Enqueuue the kernel
clSetKernelArg(kernel, 0, SIZE * SIZEOF_REAL, buffer_a);
clSetKernelArg(kernel, 1, SIZE * SIZEOF_REAL, buffer_b);
clSetKernelArg(kernel, 2, SIZE * SIZEOF_REAL, buffer_c);
global_size = SIZE;
local_size = SIZE;
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,
NULL, &event);
// Implicit barrier
clWaitForEvent(1, &event);
// Checking copyout presence
// ...
// Read data from the target
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
clEnqueueReadBuffer(queue, buffer_b, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
clEnqueueReadBuffer(queue, buffer_c, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
Note: the compiler must use events instead of barriers to wait for the
execution completion, since OpenCL barriers are used to synchronize only
target data.
Also the compiler generates OpenCL kernel. Since the kernel is generated
directly from GIMPLE, it is quite low-level. The logic of generated
kernel is like the following
__kernel k(__global double* a,
__global double* b,
__global double* c)
{
int i = get_global_id(0);
if (i < SIZE)
c[i] = a[i]/b[i];
}
Note, that the compiler may parallelize the loop, in other words, it
inserts a loop directive right before the loop. When the compiler
encounters several parallelizable loops in a single parallel region, it
may insert loop before each of them.
When the region does not contain any OpenACC directive, the compiler may
generate a single kernel for whole region.
2.1 If
When the compiler encounters if clause (in other words, the statement is
like !$ACC PARALLEL IF (condition)), it must generate two types of
region - OpenCL and native. The OpenCL part is like to parallel one.
While the other is just compiled to native code.
2.2 Async
When async presents in the directive, the compiler must not add barrier
at the end of the region. Therefore, the host code is like the parallel
one without clWaitForEvent(event) statement:
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,
NULL, NULL);
When the parameter of async clause presents, it determines an explicit
identifier of an event to wait for by wait directive or
acc_async_wait_all runtime routine.
For example, if we have code like:
!$ACC PARALLEL ASYNC(0)
...
!$ACC PARALLEL ASYNC(1)
...
!$ACC WAIT(1)
!$ACC WAIT(0)
the generated code will be like:
clEnqueueNDRangeKernel(queue, kernel0, 1, &global_size, &local_size,
0, NULL, &event[0]);
...
clEnqueueNDRangeKernel(queue, kernel1, 1, &global_size, &local_size,
0, NULL, &event[1]);
...
clWaitForEvent(&event[1]);
clWaitForEvent(&event[0]);
where event is a massive of cl_events.
2.3 Num_gangs and num_workers
Num_gangs clause determines number of target work groups, while
num_workers defines number of OpenCL work items in each work group. So,
when num_workers presents, the local_work_size parameter of
clEnqueueNDRangeKernel function is exactly num_workers parameter.
global_size = SIZE;
local_size = num_workers;
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,
NULL, &event);
Otherwise, number of work items is equal to (SIZE-1)/num_gangs+1. And
the statement to enqueue kernel is like the following.
global_size = SIZE;
local_size = (SIZE-1)/num_gangs+1;
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,
NULL, &event);
Note, that when num_workers is not defined, it is equal to
SIZE/num_gangs and vice versa.
2.4 Vector_length
Vector_length clause defines the length of vector in vector operations.
For example, if the parameter of the clause is 4, the kernel uses
double4 type instead of double. Therefore, the generated kernel is like:
__kernel k(__global double4* a,
__global double4* b,
__global double4* c)
{
int i = get_global_id(0);
if (i < SIZE/4)
c[i] = a[i]/b[i];
}
2.5 Private and firstprivate
Private clause defines the data, that is private for each work group. It
means the compiler must declare this data as __local. So, if we have
code like:
REAL :: d(SIZE)
!$ACC PARALLEL PRIVATE(d)
the generated kernel is like
__kernel k(__global double* a,
__global double* b,
__global double* c)
{
__local double d[SIZE];
int i = get_global_id(0);
if (i < SIZE)
c[i] = a[i]/b[i];
}
Note, that no data is written to or read from the target. Therefore, the
host code is not changed.
Unlike private, firstprivate specifies that the local data must be
initialized. Therefore, one must to write the data to the target.
buffer_d = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
clEnqueueWriteBuffer(queue, buffer_d, 0, 0, SIZE * SIZEOF_REAL, c,
NULL, NULL);
...
clSetKernelArg(kernel, 3, SIZE * SIZEOF_REAL, buffer_d);
Also, the generated kernel is changed, it performs copying of the data
from global to local memory.
__kernel k(__global double* a,
__global double* b,
__global double* c,
__global double* d)
{
__local double local_d[SIZE];
int i = get_global_id(0);
if (i == 0)
{
for (int i = 0; i < SIZE; i++)
local_d[i] = d[i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if (i < SIZE)
c[i] = a[i]/b[i];
}
2.6 Reduction
To show the generated host and target code for reduction clause we must
change the OpenACC code:
INTEGER :: i
REAL :: a(SIZE), b(SIZE), c(SIZE), sum
!$ACC PARALLEL REDUCTION(+:sum)
DO i = 1, size
c(i) = a(i) / b(i)
sum = sum + c(i)
ENDDO
!$ACC END PARALLEL
In addition to kernel that performs division the compiler will generate
a reduction kernel (note: this is only design proof of concept):
__kernel void reduce(__global float *c)
{
int i = get_global_id(0);
int size = get_global_size(0);
if (i < size/2)
c[i] += c[size/2 + i];
}
This kernel splits the c massive into two ones and reduces it into first
half. Since we need the sum of whole massive, the host program must
execute the kernel log(SIZE) times. Therefore, the host program is like
the following.
for (i = SIZE; i > 0; i = i / 2)
{
clSetKernelArg(reduction_kernel, 0, i, buffer_c);
global_size = i;
local_size = i;
clEnqueueNDRangeKernel(queue, reduction_kernel, 1, NULL,
global_size, local_size, 0, NULL, &event);
clEnqueueBarrier(queue);
}
clEnqueueReadBuffer(queue, buffer_c, 0, 0, 8, &sum, 1, &event, NULL);
Note the barrier to synchronize the massive's data and that we read only
first variable from the massive.
3 Kernels
In spite of parallel, kernels construct does not have many clauses. In
this case, the compiler must impicitly add them to parallelize the
execution of the region. This section provides the OpenCL code to add
them, to enqueue the kernels, which are generated from the region and to
execute them in the order.
3.1 If and async
These clauses are similar to parallel clauses, and the generated code is
the same.
3.2 Num_gangs, num_workers and vector_length
Assume that the OpenACC program is similar to parallel's one.
INTEGER :: i
REAL :: a(SIZE), b(SIZE), c(SIZE)
!$ACC KERNELS
DO i = 1, size
c(i) = a(i) / b(i)
ENDDO
!$ACC END KERNELS
Since there are no num_gangs, num_workers and vector_length in kernels
construct, the compiler must choose their appropriate values. To perform
this the compiler can call:
vector_length
cl_uint double_width;
clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE,
sizeof(double_width), double_width, NULL);
num_workers
size_t num_workers;
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,
sizeof(num_workers), num_workers, NULL);
num_gangs
int num_gangs = SIZE/num_workers;
3.3 Several statements
If the region contains several statements the compiler may split them
into several kernels. For example, if the region has two loop nests:
INTEGER :: i
REAL :: a(SIZE), b(SIZE), c(SIZE), d(SIZE)
!$ACC KERNELS
DO i = 1, size
c(i) = a(i) / b(i)
ENDDO
DO i = 1, size
d(i) = a(size + 1 - i) / b(i)
ENDDO
!$ACC END KERNELS
the compiler generates two kernels, for each nest.
__kernel k1(__global double* a,
__global double* b,
__global double* c)
{
int i = get_global_id(0);
if (i < SIZE)
c[i] = a[i]/b[i];
}
__kernel k2(__global double* a,
__global double* b,
__global double* d)
{
int i = get_global_id(0);
if (i < SIZE)
d[i] = a[SIZE - i]/b[i];
}
4 Data
Data construct defines the data to write to the target at the beginning
of the region and to read from it at the end of the region. Note, that
the compiler must not generate any OpenCL kernel for this construct.
4.1 If
If clause is similar to parallel if.
4.2 Copy and create
Copy data clauses declare the data to copy between host and target.
copyin defines the data to copy from the host to the target at the
beginning of the region, copyout - from the target at the end, copy is
sum of copyin and copyout. Finally, create defines the data to allocate
at the target, but not to copy.
So, the code for these clauses is like the following (allocated_data_add
function described below).
copyin(a)
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
...
// kernel build and run code
copyout(a)
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
// kernel build and run code
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
...
copy(a)
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
...
// kernel build and run code
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
...
create(a)
void *target_a = calloc(SIZE * 8);
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL,
target_a, NULL, NULL);
allocated_data_add("a");
free (target_a);
...
// kernel build and run code
4.3 Present
Present clause defines the data that already presents on the target.
present_or_* clauses define the data that should be checked for
presence, and, when it is not present to perform the operation. For
example, if the compiler encounters present_or_copyin(a) clause, it must
check whether a already presents, and, if not, copy it to the target.
So, the generated program must contain the data that already on the
target in a container. A container may be in the runtime library.
Therefore, the generated OpenCL code will be like the following.
present(a)
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
if (!allocated_data_contains("a"))
{
error();
}
...
// kernel build and run code
present_or_copy(a)
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
if (!allocated_data_contains("a"))
{
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
}
...
// kernel build and run code
...
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
present_or_copyin(a)
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
if (!allocated_data_contains("a"))
{
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
}
...
// kernel build and run code
present_or_copyout(a)
void *target_a = calloc(SIZE * SIZEOF_REAL);
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
if (!allocated_data_contains("a"))
{
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
}
free(target_a);
...
// kernel build and run code
...
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
present_or_create(a)
void *target_a = calloc(SIZE * SIZEOF_REAL);
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *
SIZEOF_REAL, NULL, &err);
...
if (!allocated_data_contains("a"))
{
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,
NULL, NULL);
allocated_data_add("a");
}
free(target_a);
...
// kernel build and run code
...
5 Host_data
When we have some library that aware of GPU, we can use pointers to
device data in calls to that library:
!$acc data copy(A)
!$acc parallel loop
do i=1,N
enddo
!$acc end parallel loop
! call function that requires device pointer
!$acc host_data use_device(A)
call do_something_on_device(A)
!$acc end host_data
!$acc parallel loop
do i=1,N
enddo
!$acc end parallel loop
!$acc end data
6 Loop
Loop directive describes a loop to be parallelized. Note, that even if
the loop is not preceded be the directive and appears within parallel or
kernels region, the compiler may add the directive implicitly.
6.1 Collapse
Collapse directive defines how many nested loop must be parallelized.
For example, if we have code like
!$ACC LOOP
DO i = 1,10
DO j = 1,10
a(i,j) = b(i,j)/c(i,j)
ENDDO
ENDDO
The compiler implicitly adds collapse(1) and generated kernel is like
__kernel k(__global double* a,
__global double* b,
__global double* c)
{
int i = get_global_id(0);
if (i < 10)
for (int j = 0; j < 10; j++)
a[i][j] = b[i][j]/c[i][j];
}
However, when programmer explicitly adds collapse(2) the compiler
generates the following kernel
__kernel k(__global double* a,
__global double* b,
__global double* c)
{
int i = get_global_id(0);
if (i < 100)
a[i] = b[i]/c[i];
}
6.2 Gang, vector and worker
Specify type of parallelism. Across gangs, vectors or workers.
6.3 Seq and independent
Seq clause means "execute sequentially", while independent tells that
iterations are data independent. The compiler uses seq when it cannot
parallelize the loop.
6.4 Private and reduction
These clauses are resemble to parallel's clauses.
7 Cache
Cache directive defines the variables that must be cached by kernel. For
example, if we have code like
!$ACC LOOP
DO i = 1, 10
!$ACC CACHE(d)
a(i) = d * b(i)
ENDDO
the compiler generates a kernel like
__kernel void k(__global double *a,
__global double *b,
__global double *d)
{
double cache_d;
cache_d = *d;
for (int i = 0; i < 64; i++)
a[i] = cache_d*b[i];
}
8 Update
Update directive looks like data clauses. host is like copyout, device -
copyin. if and async clauses are resemble to parallel clauses.
Here is use case for update directive:
! the data directive causes array A resides
! on device through all region
!$acc data copy(A)
!$acc parallel loop
do i=1,N
A(i) = ...
enddo
!$acc end parallel loop
! however in the middle of region
! we need to do something on host
! with this array,
! so we update host copy
!$acc update host(A)
call do_something_on_host(A)
! after that we need to refresh
! device copy of the array
!$acc update device(A)
!$acc parallel loop
do i=1,N
enddo
!$acc end parallel loop
!$acc end data
I'd really appreciated if you provide some comments on it.
--
Regards, Ilmir Usmanov