Hello all,
Recently, it was created branch for OpenACC (http://gcc.gnu.org/ml/gcc/2013-09/msg00235.html)

As had been promised, I provide design notes, that describe our understanding of transformations of OpenACC constructs to OpenCL ones. That's only the design draft and may be changed in future.


OpenACC to OpenCL transformation

Contents
1 Overview
2 Parallel
2.1 If
2.2 Async
2.3 Num_gangs and num_workers
2.4 Vector_length
2.5 Private and firstprivate
2.6 Reduction
3 Kernels
3.1 If and async
3.2 Num_gangs, num_workers and vector_length
3.3 Several statements
4 Data
4.1 If
4.2 Copy and create
4.3 Present
5 Host_data
6 Loop
6.1 Collapse
6.2 Gang, vector and worker
6.3 Seq and independent
6.4 Private and reduction
7 Cache
8 Update


1 Overview

This page describes OpenCL function call sequences related to each OpenACC directive or clause. The content is only design draft.


2 Parallel
Parallel construct says to the compiler that the whole region should be run on the target in parallel. It does not mean that any loop from the region must be parallelizable.

Let the code to compile is like the following.

 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE)

 !$ACC PARALLEL
 DO i = 1, size
   c(i) = a(i) / b(i)
 ENDDO
 !$ACC END PARALLEL

Where SIZE in smaller than workgroup (gang) size. Since variables are not declared in data clauses, the compiler must add present_or_copy(a, b, c) clause (however, PGI compiler generates present_or_copyin(a, b) and present_or_copyout(c) clauses, since there is no reading from c and storing to a, b). Also it must insert an implicit barrier to the end of the parallel region since async clause does not present.
Generated OpenCL call sequence is like the following:

 // ...
 // Initialization
 // ...

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err); buffer_b = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err); buffer_c = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);

 // Checking copyin presence
 // ...

 // Write data to the target
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL); clEnqueueWriteBuffer(queue, buffer_b, 0, 0, SIZE * SIZEOF_REAL, b, NULL, NULL); clEnqueueWriteBuffer(queue, buffer_c, 0, 0, SIZE * SIZEOF_REAL, c, NULL, NULL);

 // Building kernel
 // ...

 // Enqueuue the kernel
 clSetKernelArg(kernel, 0, SIZE * SIZEOF_REAL, buffer_a);
 clSetKernelArg(kernel, 1, SIZE * SIZEOF_REAL, buffer_b);
 clSetKernelArg(kernel, 2, SIZE * SIZEOF_REAL, buffer_c);
 global_size = SIZE;
 local_size = SIZE;
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0, NULL, &event);

 // Implicit barrier
 clWaitForEvent(1, &event);

 // Checking copyout presence
 // ...

 // Read data from the target
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL); clEnqueueReadBuffer(queue, buffer_b, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL); clEnqueueReadBuffer(queue, buffer_c, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);

Note: the compiler must use events instead of barriers to wait for the execution completion, since OpenCL barriers are used to synchronize only target data.

Also the compiler generates OpenCL kernel. Since the kernel is generated directly from GIMPLE, it is quite low-level. The logic of generated kernel is like the following

 __kernel k(__global double* a,
            __global double* b,
            __global double* c)
 {
   int i = get_global_id(0);
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }

Note, that the compiler may parallelize the loop, in other words, it inserts a loop directive right before the loop. When the compiler encounters several parallelizable loops in a single parallel region, it may insert loop before each of them.

When the region does not contain any OpenACC directive, the compiler may generate a single kernel for whole region.


2.1 If
When the compiler encounters if clause (in other words, the statement is like !$ACC PARALLEL IF (condition)), it must generate two types of region - OpenCL and native. The OpenCL part is like to parallel one. While the other is just compiled to native code.


2.2 Async
When async presents in the directive, the compiler must not add barrier at the end of the region. Therefore, the host code is like the parallel one without clWaitForEvent(event) statement: clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0, NULL, NULL); When the parameter of async clause presents, it determines an explicit identifier of an event to wait for by wait directive or acc_async_wait_all runtime routine.
For example, if we have code like:

 !$ACC PARALLEL ASYNC(0)
 ...
 !$ACC PARALLEL ASYNC(1)
 ...
 !$ACC WAIT(1)
 !$ACC WAIT(0)

the generated code will be like:

clEnqueueNDRangeKernel(queue, kernel0, 1, &global_size, &local_size, 0, NULL, &event[0]);
 ...
clEnqueueNDRangeKernel(queue, kernel1, 1, &global_size, &local_size, 0, NULL, &event[1]);
 ...
 clWaitForEvent(&event[1]);
 clWaitForEvent(&event[0]);

where event is a massive of cl_events.


2.3 Num_gangs and num_workers
Num_gangs clause determines number of target work groups, while num_workers defines number of OpenCL work items in each work group. So, when num_workers presents, the local_work_size parameter of clEnqueueNDRangeKernel function is exactly num_workers parameter.

 global_size = SIZE;
 local_size = num_workers;
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0, NULL, &event);

Otherwise, number of work items is equal to (SIZE-1)/num_gangs+1. And the statement to enqueue kernel is like the following.

 global_size = SIZE;
 local_size = (SIZE-1)/num_gangs+1;
clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0, NULL, &event);

Note, that when num_workers is not defined, it is equal to SIZE/num_gangs and vice versa.


2.4 Vector_length
Vector_length clause defines the length of vector in vector operations. For example, if the parameter of the clause is 4, the kernel uses double4 type instead of double. Therefore, the generated kernel is like:

 __kernel k(__global double4* a,
            __global double4* b,
            __global double4* c)
 {
   int i = get_global_id(0);
   if (i < SIZE/4)
     c[i] = a[i]/b[i];
 }


2.5 Private and firstprivate
Private clause defines the data, that is private for each work group. It means the compiler must declare this data as __local. So, if we have code like:

 REAL :: d(SIZE)
 !$ACC PARALLEL PRIVATE(d)

the generated kernel is like

 __kernel k(__global double* a,
            __global double* b,
            __global double* c)
 {
   __local double d[SIZE];
   int i = get_global_id(0);
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }

Note, that no data is written to or read from the target. Therefore, the host code is not changed.

Unlike private, firstprivate specifies that the local data must be initialized. Therefore, one must to write the data to the target.

buffer_d = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
clEnqueueWriteBuffer(queue, buffer_d, 0, 0, SIZE * SIZEOF_REAL, c, NULL, NULL);
 ...
 clSetKernelArg(kernel, 3, SIZE * SIZEOF_REAL, buffer_d);

Also, the generated kernel is changed, it performs copying of the data from global to local memory.

 __kernel k(__global double* a,
            __global double* b,
            __global double* c,
            __global double* d)
 {
   __local double local_d[SIZE];
   int i = get_global_id(0);
   if (i == 0)
   {
     for (int i = 0; i < SIZE; i++)
       local_d[i] = d[i];
     barrier(CLK_LOCAL_MEM_FENCE);
   }
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }


2.6 Reduction
To show the generated host and target code for reduction clause we must change the OpenACC code:

 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE), sum

 !$ACC PARALLEL REDUCTION(+:sum)
 DO i = 1, size
   c(i) = a(i) / b(i)
   sum = sum + c(i)
 ENDDO
 !$ACC END PARALLEL

In addition to kernel that performs division the compiler will generate a reduction kernel (note: this is only design proof of concept):

 __kernel void reduce(__global float *c)
 {
   int i = get_global_id(0);
   int size = get_global_size(0);
   if (i < size/2)
     c[i] += c[size/2 + i];
 }

This kernel splits the c massive into two ones and reduces it into first half. Since we need the sum of whole massive, the host program must execute the kernel log(SIZE) times. Therefore, the host program is like the following.

 for (i = SIZE; i > 0; i = i / 2)
 {
   clSetKernelArg(reduction_kernel, 0, i, buffer_c);
   global_size = i;
   local_size = i;
clEnqueueNDRangeKernel(queue, reduction_kernel, 1, NULL, global_size, local_size, 0, NULL, &event);
   clEnqueueBarrier(queue);
 }
 clEnqueueReadBuffer(queue, buffer_c, 0, 0, 8, &sum, 1, &event, NULL);

Note the barrier to synchronize the massive's data and that we read only first variable from the massive.


3 Kernels
In spite of parallel, kernels construct does not have many clauses. In this case, the compiler must impicitly add them to parallelize the execution of the region. This section provides the OpenCL code to add them, to enqueue the kernels, which are generated from the region and to execute them in the order.


3.1 If and async
These clauses are similar to parallel clauses, and the generated code is the same.


3.2 Num_gangs, num_workers and vector_length
Assume that the OpenACC program is similar to parallel's one.

 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE)

 !$ACC KERNELS
 DO i = 1, size
   c(i) = a(i) / b(i)
 ENDDO
 !$ACC END KERNELS

Since there are no num_gangs, num_workers and vector_length in kernels construct, the compiler must choose their appropriate values. To perform this the compiler can call:
vector_length

 cl_uint double_width;
clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, sizeof(double_width), double_width, NULL);

num_workers

 size_t num_workers;
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(num_workers), num_workers, NULL);

num_gangs

 int num_gangs = SIZE/num_workers;


3.3 Several statements
If the region contains several statements the compiler may split them into several kernels. For example, if the region has two loop nests:

 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE), d(SIZE)

 !$ACC KERNELS
 DO i = 1, size
   c(i) = a(i) / b(i)
 ENDDO
 DO i = 1, size
   d(i) = a(size + 1 - i) / b(i)
 ENDDO
 !$ACC END KERNELS

the compiler generates two kernels, for each nest.

 __kernel k1(__global double* a,
             __global double* b,
             __global double* c)
 {
   int i = get_global_id(0);
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }
 __kernel k2(__global double* a,
             __global double* b,
             __global double* d)
 {
   int i = get_global_id(0);
   if (i < SIZE)
     d[i] = a[SIZE - i]/b[i];
 }


4 Data
Data construct defines the data to write to the target at the beginning of the region and to read from it at the end of the region. Note, that the compiler must not generate any OpenCL kernel for this construct.


4.1 If
If clause is similar to parallel if.


4.2 Copy and create
Copy data clauses declare the data to copy between host and target. copyin defines the data to copy from the host to the target at the beginning of the region, copyout - from the target at the end, copy is sum of copyin and copyout. Finally, create defines the data to allocate at the target, but not to copy.

So, the code for these clauses is like the following (allocated_data_add function described below).

copyin(a)

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
 allocated_data_add("a");
 ...
 // kernel build and run code

copyout(a)

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
 // kernel build and run code
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
 allocated_data_add("a");
 ...

copy(a)

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
 allocated_data_add("a");
 ...
 // kernel build and run code
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
 ...

create(a)

 void *target_a = calloc(SIZE * 8);
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, target_a, NULL, NULL);
 allocated_data_add("a");
 free (target_a);
 ...
 // kernel build and run code


4.3 Present
Present clause defines the data that already presents on the target. present_or_* clauses define the data that should be checked for presence, and, when it is not present to perform the operation. For example, if the compiler encounters present_or_copyin(a) clause, it must check whether a already presents, and, if not, copy it to the target. So, the generated program must contain the data that already on the target in a container. A container may be in the runtime library.

Therefore, the generated OpenCL code will be like the following.

present(a)

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
 if (!allocated_data_contains("a"))
 {
   error();
 }
 ...
 // kernel build and run code

present_or_copy(a)

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
 if (!allocated_data_contains("a"))
 {
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
   allocated_data_add("a");
 }
 ...
 // kernel build and run code
 ...
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);

present_or_copyin(a)

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
 if (!allocated_data_contains("a"))
 {
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
   allocated_data_add("a");
 }
 ...
 // kernel build and run code

present_or_copyout(a)

 void *target_a = calloc(SIZE * SIZEOF_REAL);
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
 if (!allocated_data_contains("a"))
 {
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
   allocated_data_add("a");
 }
 free(target_a);
 ...
 // kernel build and run code
 ...
clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);

present_or_create(a)

 void *target_a = calloc(SIZE * SIZEOF_REAL);
buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE * SIZEOF_REAL, NULL, &err);
 ...
 if (!allocated_data_contains("a"))
 {
clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a, NULL, NULL);
   allocated_data_add("a");
 }
 free(target_a);
 ...
 // kernel build and run code
 ...


5 Host_data
When we have some library that aware of GPU, we can use pointers to device data in calls to that library:

 !$acc data copy(A)
 !$acc parallel loop
 do i=1,N
 enddo
 !$acc end parallel loop
 ! call function that requires device pointer
 !$acc host_data use_device(A)
 call do_something_on_device(A)
 !$acc end host_data
 !$acc parallel loop
 do i=1,N
 enddo
 !$acc end parallel loop
 !$acc end data


6 Loop
Loop directive describes a loop to be parallelized. Note, that even if the loop is not preceded be the directive and appears within parallel or kernels region, the compiler may add the directive implicitly.

6.1 Collapse
Collapse directive defines how many nested loop must be parallelized. For example, if we have code like

 !$ACC LOOP
 DO i = 1,10
   DO j = 1,10
     a(i,j) = b(i,j)/c(i,j)
   ENDDO
 ENDDO

The compiler implicitly adds collapse(1) and generated kernel is like

 __kernel k(__global double* a,
            __global double* b,
            __global double* c)
 {
   int i = get_global_id(0);
   if (i < 10)
     for (int j = 0; j < 10; j++)
       a[i][j] = b[i][j]/c[i][j];
 }

However, when programmer explicitly adds collapse(2) the compiler generates the following kernel

 __kernel k(__global double* a,
            __global double* b,
            __global double* c)
 {
   int i = get_global_id(0);
   if (i < 100)
     a[i] = b[i]/c[i];
 }


6.2 Gang, vector and worker
Specify type of parallelism. Across gangs, vectors or workers.


6.3 Seq and independent
Seq clause means "execute sequentially", while independent tells that iterations are data independent. The compiler uses seq when it cannot parallelize the loop.


6.4 Private and reduction
These clauses are resemble to parallel's clauses.


7 Cache
Cache directive defines the variables that must be cached by kernel. For example, if we have code like

 !$ACC LOOP
 DO i = 1, 10
   !$ACC CACHE(d)
   a(i) = d * b(i)
 ENDDO

the compiler generates a kernel like

 __kernel void k(__global double *a,
                 __global double *b,
                 __global double *d)
 {
   double cache_d;
   cache_d = *d;
   for (int i = 0; i < 64; i++)
     a[i] = cache_d*b[i];
 }


8 Update
Update directive looks like data clauses. host is like copyout, device - copyin. if and async clauses are resemble to parallel clauses.

Here is use case for update directive:

 ! the data directive causes array A resides
 ! on device through all region
 !$acc data copy(A)
 !$acc parallel loop
  do i=1,N
   A(i) = ...
  enddo
 !$acc end parallel loop
 ! however in the middle of region
 ! we need to do something on host
 ! with this array,
 ! so we update host copy
 !$acc update host(A)
 call do_something_on_host(A)
 ! after that we need to refresh
 ! device copy of the array
 !$acc update device(A)
 !$acc parallel loop
  do i=1,N
  enddo
  !$acc end parallel loop
 !$acc end data


I'd really appreciated if you provide some comments on it.

--
Regards, Ilmir Usmanov

Reply via email to