RFC: OpenACC to OpenCL translation

Ilmir Usmanov Tue, 01 Oct 2013 06:45:36 -0700

Hello all,

Recently, it was created branch for OpenACC(http://gcc.gnu.org/ml/gcc/2013-09/msg00235.html)

As had been promised, I provide design notes, that describe ourunderstanding of transformations of OpenACC constructs to OpenCL ones.That's only the design draft and may be changed in future.



OpenACC to OpenCL transformation

Contents
1 Overview
2 Parallel
2.1 If
2.2 Async
2.3 Num_gangs and num_workers
2.4 Vector_length
2.5 Private and firstprivate
2.6 Reduction
3 Kernels
3.1 If and async
3.2 Num_gangs, num_workers and vector_length
3.3 Several statements
4 Data
4.1 If
4.2 Copy and create
4.3 Present
5 Host_data
6 Loop
6.1 Collapse
6.2 Gang, vector and worker
6.3 Seq and independent
6.4 Private and reduction
7 Cache
8 Update


1 Overview

This page describes OpenCL function call sequences related to eachOpenACC directive or clause. The content is only design draft.



2 Parallel

Parallel construct says to the compiler that the whole region should berun on the target in parallel. It does not mean that any loop from theregion must be parallelizable.


Let the code to compile is like the following.

 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE)

 !$ACC PARALLEL
 DO i = 1, size
   c(i) = a(i) / b(i)
 ENDDO
 !$ACC END PARALLEL

Where SIZE in smaller than workgroup (gang) size. Since variables arenot declared in data clauses, the compiler must add present_or_copy(a,b, c) clause (however, PGI compiler generates present_or_copyin(a, b)and present_or_copyout(c) clauses, since there is no reading from c andstoring to a, b). Also it must insert an implicit barrier to the end ofthe parallel region since async clause does not present.

Generated OpenCL call sequence is like the following:

 // ...
 // Initialization
 // ...

buffer_a = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *SIZEOF_REAL, NULL, &err);buffer_b = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *SIZEOF_REAL, NULL, &err);buffer_c = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *SIZEOF_REAL, NULL, &err);


 // Checking copyin presence
 // ...

 // Write data to the target

clEnqueueWriteBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,NULL, NULL);clEnqueueWriteBuffer(queue, buffer_b, 0, 0, SIZE * SIZEOF_REAL, b,NULL, NULL);clEnqueueWriteBuffer(queue, buffer_c, 0, 0, SIZE * SIZEOF_REAL, c,NULL, NULL);


 // Building kernel
 // ...

 // Enqueuue the kernel
 clSetKernelArg(kernel, 0, SIZE * SIZEOF_REAL, buffer_a);
 clSetKernelArg(kernel, 1, SIZE * SIZEOF_REAL, buffer_b);
 clSetKernelArg(kernel, 2, SIZE * SIZEOF_REAL, buffer_c);
 global_size = SIZE;
 local_size = SIZE;

clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,NULL, &event);


 // Implicit barrier
 clWaitForEvent(1, &event);

 // Checking copyout presence
 // ...

 // Read data from the target

clEnqueueReadBuffer(queue, buffer_a, 0, 0, SIZE * SIZEOF_REAL, a,NULL, NULL);clEnqueueReadBuffer(queue, buffer_b, 0, 0, SIZE * SIZEOF_REAL, a,NULL, NULL);clEnqueueReadBuffer(queue, buffer_c, 0, 0, SIZE * SIZEOF_REAL, a,NULL, NULL);

Note: the compiler must use events instead of barriers to wait for theexecution completion, since OpenCL barriers are used to synchronize onlytarget data.

Also the compiler generates OpenCL kernel. Since the kernel is generateddirectly from GIMPLE, it is quite low-level. The logic of generatedkernel is like the following


 __kernel k(__global double* a,
            __global double* b,
            __global double* c)
 {
   int i = get_global_id(0);
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }

Note, that the compiler may parallelize the loop, in other words, itinserts a loop directive right before the loop. When the compilerencounters several parallelizable loops in a single parallel region, itmay insert loop before each of them.

When the region does not contain any OpenACC directive, the compiler maygenerate a single kernel for whole region.



2.1 If

When the compiler encounters if clause (in other words, the statement islike !$ACC PARALLEL IF (condition)), it must generate two types ofregion - OpenCL and native. The OpenCL part is like to parallel one.While the other is just compiled to native code.



2.2 Async

When async presents in the directive, the compiler must not add barrierat the end of the region. Therefore, the host code is like the parallelone without clWaitForEvent(event) statement:clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,NULL, NULL);When the parameter of async clause presents, it determines an explicitidentifier of an event to wait for by wait directive oracc_async_wait_all runtime routine.

For example, if we have code like:

 !$ACC PARALLEL ASYNC(0)
 ...
 !$ACC PARALLEL ASYNC(1)
 ...
 !$ACC WAIT(1)
 !$ACC WAIT(0)

the generated code will be like:

clEnqueueNDRangeKernel(queue, kernel0, 1, &global_size, &local_size,0, NULL, &event[0]);

...

clEnqueueNDRangeKernel(queue, kernel1, 1, &global_size, &local_size,0, NULL, &event[1]);

 ...
 clWaitForEvent(&event[1]);
 clWaitForEvent(&event[0]);

where event is a massive of cl_events.


2.3 Num_gangs and num_workers

Num_gangs clause determines number of target work groups, whilenum_workers defines number of OpenCL work items in each work group. So,when num_workers presents, the local_work_size parameter ofclEnqueueNDRangeKernel function is exactly num_workers parameter.


 global_size = SIZE;
 local_size = num_workers;

clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,NULL, &event);

Otherwise, number of work items is equal to (SIZE-1)/num_gangs+1. Andthe statement to enqueue kernel is like the following.


 global_size = SIZE;
 local_size = (SIZE-1)/num_gangs+1;

clEnqueueNDRangeKernel(queue, kernel, 1, &global_size, &local_size, 0,NULL, &event);

Note, that when num_workers is not defined, it is equal toSIZE/num_gangs and vice versa.



2.4 Vector_length

Vector_length clause defines the length of vector in vector operations.For example, if the parameter of the clause is 4, the kernel usesdouble4 type instead of double. Therefore, the generated kernel is like:


 __kernel k(__global double4* a,
            __global double4* b,
            __global double4* c)
 {
   int i = get_global_id(0);
   if (i < SIZE/4)
     c[i] = a[i]/b[i];
 }


2.5 Private and firstprivate

Private clause defines the data, that is private for each work group. Itmeans the compiler must declare this data as __local. So, if we havecode like:


 REAL :: d(SIZE)
 !$ACC PARALLEL PRIVATE(d)

the generated kernel is like

 __kernel k(__global double* a,
            __global double* b,
            __global double* c)
 {
   __local double d[SIZE];
   int i = get_global_id(0);
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }

Note, that no data is written to or read from the target. Therefore, thehost code is not changed.

Unlike private, firstprivate specifies that the local data must beinitialized. Therefore, one must to write the data to the target.

buffer_d = clCreateBuffer(context, CL_MEM_READ_WRITE, SIZE *SIZEOF_REAL, NULL, &err);

...

clEnqueueWriteBuffer(queue, buffer_d, 0, 0, SIZE * SIZEOF_REAL, c,NULL, NULL);

 ...
 clSetKernelArg(kernel, 3, SIZE * SIZEOF_REAL, buffer_d);

Also, the generated kernel is changed, it performs copying of the datafrom global to local memory.


 __kernel k(__global double* a,
            __global double* b,
            __global double* c,
            __global double* d)
 {
   __local double local_d[SIZE];
   int i = get_global_id(0);
   if (i == 0)
   {
     for (int i = 0; i < SIZE; i++)
       local_d[i] = d[i];
     barrier(CLK_LOCAL_MEM_FENCE);
   }
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }


2.6 Reduction

To show the generated host and target code for reduction clause we mustchange the OpenACC code:


 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE), sum

 !$ACC PARALLEL REDUCTION(+:sum)
 DO i = 1, size
   c(i) = a(i) / b(i)
   sum = sum + c(i)
 ENDDO
 !$ACC END PARALLEL

In addition to kernel that performs division the compiler will generatea reduction kernel (note: this is only design proof of concept):


 __kernel void reduce(__global float *c)
 {
   int i = get_global_id(0);
   int size = get_global_size(0);
   if (i < size/2)
     c[i] += c[size/2 + i];
 }

This kernel splits the c massive into two ones and reduces it into firsthalf. Since we need the sum of whole massive, the host program mustexecute the kernel log(SIZE) times. Therefore, the host program is likethe following.


 for (i = SIZE; i > 0; i = i / 2)
 {
   clSetKernelArg(reduction_kernel, 0, i, buffer_c);
   global_size = i;
   local_size = i;

clEnqueueNDRangeKernel(queue, reduction_kernel, 1, NULL,global_size, local_size, 0, NULL, &event);

   clEnqueueBarrier(queue);
 }
 clEnqueueReadBuffer(queue, buffer_c, 0, 0, 8, &sum, 1, &event, NULL);

Note the barrier to synchronize the massive's data and that we read onlyfirst variable from the massive.



3 Kernels

In spite of parallel, kernels construct does not have many clauses. Inthis case, the compiler must impicitly add them to parallelize theexecution of the region. This section provides the OpenCL code to addthem, to enqueue the kernels, which are generated from the region and toexecute them in the order.



3.1 If and async

These clauses are similar to parallel clauses, and the generated code isthe same.



3.2 Num_gangs, num_workers and vector_length
Assume that the OpenACC program is similar to parallel's one.

 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE)

 !$ACC KERNELS
 DO i = 1, size
   c(i) = a(i) / b(i)
 ENDDO
 !$ACC END KERNELS

Since there are no num_gangs, num_workers and vector_length in kernelsconstruct, the compiler must choose their appropriate values. To performthis the compiler can call:

vector_length

 cl_uint double_width;

clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE,sizeof(double_width), double_width, NULL);


num_workers

 size_t num_workers;

clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(num_workers), num_workers, NULL);


num_gangs

 int num_gangs = SIZE/num_workers;


3.3 Several statements

If the region contains several statements the compiler may split theminto several kernels. For example, if the region has two loop nests:


 INTEGER :: i
 REAL :: a(SIZE), b(SIZE), c(SIZE), d(SIZE)

 !$ACC KERNELS
 DO i = 1, size
   c(i) = a(i) / b(i)
 ENDDO
 DO i = 1, size
   d(i) = a(size + 1 - i) / b(i)
 ENDDO
 !$ACC END KERNELS

the compiler generates two kernels, for each nest.

 __kernel k1(__global double* a,
             __global double* b,
             __global double* c)
 {
   int i = get_global_id(0);
   if (i < SIZE)
     c[i] = a[i]/b[i];
 }
 __kernel k2(__global double* a,
             __global double* b,
             __global double* d)
 {
   int i = get_global_id(0);
   if (i < SIZE)
     d[i] = a[SIZE - i]/b[i];
 }


4 Data

Data construct defines the data to write to the target at the beginningof the region and to read from it at the end of the region. Note, thatthe compiler must not generate any OpenCL kernel for this construct.



4.1 If
If clause is similar to parallel if.


4.2 Copy and create

Copy data clauses declare the data to copy between host and target.copyin defines the data to copy from the host to the target at thebeginning of the region, copyout - from the target at the end, copy issum of copyin and copyout. Finally, create defines the data to allocateat the target, but not to copy.

So, the code for these clauses is like the following (allocated_data_addfunction described below).


copyin(a)