Re: [GSoC][Enhance OpenACC support] Uncertain about Cache directive functionality and device_type clause goal

Zhang lv via Gcc Fri, 04 Apr 2025 04:16:50 -0700

Hi Thomas,

Thanks for your detailed response! I've updated the proposal based on the
feedback. Please kindly check it out. Thanks a lot!

Project Goals and Tasks

GCC currently only partially supports the features specified in OpenACC
2.6. This project aims to enhance GCC's OpenACC support in the following
areas:

*1. OpenACC **acc_memcpy_device** Runtime API Routine*

The acc_memcpy_device routine is currently missing in GCC's OpenACC runtime
implementation. According to the specification, this routine copies a
specified number of bytes from one device address (data_dev_src) to another
device address (data_dev_dest). Both addresses must reside in the current
device’s memory. There is also an asynchronous variant that performs the
data transfer on a specified async queue (async_arg). The routine must
handle error cases such as null pointers and invalid async argument values.
This function shares a similar implementation pattern with
acc_memcpy_to_device and acc_memcpy_from_device, which transfer data
between host and device memory.

Implementation will mainly involve modifying the following files:

   - libgomp/libgomp.map
   - libgomp/openacc.h
   - libgomp/openacc_lib.h
   - libgomp/openacc.f90
   - libgomp/oacc-mem.c
   - libgomp/libgomp.h
   - libgomp/target.c
   - libgomp/plugin/plugin-nvptx.c
   - libgomp/plugin/plugin-gcn.c

The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host,
gomp_device_copy , and gomp_device_copy_async were primarily designed for
acc_memcpy_to_device and acc_memcpy_from_device, which handle host-device
transfers.

At the device-specific level, these functions delegate low-level memory
operations to libgomp plugins (e.g., libgomp/plugin/plugin-nvptx.c for
NVIDIA and libgomp/plugin/plugin-gcn.c for AMD), which are dynamically
loaded from shared object (.so) files and use target runtime APIs (like
CUDA or ROCm) to allocate memory, launch kernels, and manage transfers.

For acc_memcpy_device, which handles device-to-device transfers, we should
design a similar logic. Further investigation is needed to structure and
implement this functionality effectively.

*2. Support for **init**, **shutdown**, and **set** Directives*

These directives are currently unsupported at the front-end level in GCC,
even though their corresponding runtime APIs—acc_init, acc_shutdown,
acc_set_device_type, and their async queue variants—are implemented. The
goal here is to add parsing support in the front end to map these
directives to the appropriate built-in functions. In GCC, front ends map
OpenACC directives to BUILT_IN_GOACC_* entries defined in
gcc/omp-builtins.def, and the back end expands these into runtime API
calls. This task involves analyzing and extending front-end source files,
taking inspiration from the implementation of the wait directive. Relevant
files include:

   - gcc/c-family/c-omp.cc
   - gcc/c/c-parser.cc
   - gcc/cp/parser.cc
   - gcc/fortran/trans-openmp.cc
   - gcc/omp-builtins.def
   - gcc/omp-oacc-kernels-decompose.cc

*3. Make the OpenACC **cache** Directive Actually Do Something*

Currently, the cache directive in OpenACC is parsed at the front end, but
not used for any optimization purposes, such as data prefetching or moving
data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]). We can
leverage existing prefetch support in GCC, such as the middle-end aprefetch
pass, which analyzes nested loop structures and inserts __builtin_prefetch
calls in GIMPLE for the innermost loop. These are later expanded by the
backend expand pass to target-specific prefetch instructions based on the
ISA. On CPUs, __builtin_prefetch is typically mapped to prefetch
instructions in the CPU ISA. By enabling options like -fopenacc and setting
proper flags, we can also define a customized __builtin_prefetch_openacc
call and write new RTL templates to map prefetch calls to GPU-specific
instructions—if such instructions exist for the target architecture. This
will be discussed at the end of this section.

To be more specific about how we can leverage existing infrastructure: when
the OACC_CACHE directive is lowered to GIMPLE via gimplify_oacc_cache, we
can insert __builtin_prefetch_openacc calls at appropriate locations using
gimple_build_call and gsi_insert_before in the aprefetch pass located in
tree-ssa-loop-prefetch.cc. These built-ins will then be picked up by the
backend RTL expansion pass. However, to the best of my knowledge, there is
currently no backend support for expanding prefetch instructions for GPU
targets. Therefore, additional backend work is needed to implement the
necessary RTL patterns or instruction selection hooks for GPU architectures.

Aside from the aprefetch pass enabled by the -fprefetch-loop-arrays
option, GCC also provides several tuning parameters via --param, such as
prefetch-latency, simultaneous-prefetches, and sched-autopref-queue-depth [7],
to control how aggressive or conservative the prefetching should be. All of
these parameters influence the aprefetch pass's analysis of loop structure,
so they generally require no special handling from us. However, they may
restrict how many prefetch instructions can be issued effectively. These
limitations stem from hardware resource models. Since we're manually
issuing software prefetch instructions, we can avoid complex modeling for
now and focus on building the basic infrastructure. This differs from
hardware prefetching, which is handled automatically by the processor and
may interact with software prefetching. While understanding this
interaction is useful, it is beyond the scope of this project and can be
considered in future work.

Both AMD and NVIDIA GPUs support ISA-level prefetching, which simplifies
implementing OpenACC’s cache directive. For AMD CDNA GPUs [8], prefetch
behavior can be achieved using load instructions like BUFFER_LOAD_DWORD
with cache-control flags (e.g., SC1, SC0, NT). Similarly, NVIDIA GPUs
support dedicated prefetch instructions in the PTX ISA [9], such as
prefetch.global.L1
[a] and prefetchu.L1 [a], which allow prefetching to various cache levels
and memory spaces. These capabilities enable the GCC RTL backend to map
__builtin_prefetch_openacc to the appropriate target instructions or
intrinsics, depending on the GPU architecture. However, if the available
GPU instructions are insufficient to cover all prefetching use cases, we
should consider falling back to runtime API routines from libgomp, such as
acc_memcpy_device, to manually manage data movement.

*4. OpenACC **bind** Clause Support*

This section introduces what the bind clause does in OpenACC, analyzes how
the existing GCC infrastructure supports routine directives, and proposes a
path for implementing full bind clause support.

The bind clause in OpenACC lets you link a device function to a host
function name using the routine directive. This means you should define one
function for the host and a separate one for the device. When calling the
function on the device inside an OpenACC compute region, the device version
you specified with bind will be used. For example:

#pragma acc routine worker bind(sum1)
int sum(int, float *);

int sum(int n,float *A)
{
        int i;
        float s=0.0f;
        for(i=0;i<n;i++){
                s=s+A[i];
        }
        return s;
}

#pragma acc routine worker
int sum1(int n,float *A)
{
        int i;
        float s=0.0f;
        #pragma acc loop vector reduction(+:s)
        for(i=0;i<n;i++){
                s=s+A[i]+2;
        }
        return s;
}

By default, when a function is declared with #pragma acc routine, the
compiler generates two versions of the function: one for the host and one
for the target device. The bind clause modifies this behavior by directing
the compiler to use the device version of an alternative function—sum1 in
this case—when the function is called within an OpenACC compute region.

This means that:

   - If the function sum is called from host code, the host implementation
   of sum will be invoked.
   - If the same function sum is called from within an OpenACC parallel
   region, the device code for sum1 will be used instead.

This mechanism allows for greater flexibility in mapping host-visible API
functions to optimized or device-specific versions, facilitating
performance tuning or usage of hand-optimized GPU code.

Additionally, the bind clause can reference a CUDA device function,
enabling integration with low-level CUDA kernels or library routines when
compiling OpenACC code for NVIDIA GPUs.

Related GCC Files and Functions:

   - gcc/omp-general.cc
      - oacc_build_routine_dims: Generates attributes for parallelism level.
      - oacc_verify_routine_clauses: Verifies and normalizes routine
      clauses.
   - gcc/c/c-parser.c: c_finish_oacc_routine calls
   oacc_verify_routine_clauses.
   - gcc/cp/parser.c: cp_finalize_oacc_routine calls
   oacc_verify_routine_clauses.
   - gcc/fortran/f95-lang.c and trans-decl.c: Handle OpenACC routine
   clauses via attribute logic.

To implement support for the bind clause in GCC:

   - Clause Parsing Integration: Extend the parser to recognize and store
   the bind clause during front-end processing.
   - Function Binding Mechanism: Add logic to map the host function name to
   the target device function as specified in the bind clause.
   - Host-Device Code Generation: Study how the host version of the
   function is generated and invoked. GCC uses two compilers during
   offloading: one for the host and one for the accelerator. After parsing,
   the backend expands functions to separate addresses.
   - Linking Investigation: Investigate whether linking affects this
   mapping. Specifically, study the generation and role of:
      - .gnu.lto_.offload_table: IR for declarations from the host compiler.
      - .gnu.offload_lto_.offload_table: IR for declarations from the
      accelerator compiler.

*5. OpenACC **device_type** Clause*

The device_type clause in OpenACC enables developers to fine-tune execution
behavior per target architecture (e.g., NVIDIA or AMD GPUs) within the same
directive. The allowed clauses following device_type differ by directive:
for compute constructs (parallel, kernels), valid clauses include async,
wait, num_gangs, num_workers, and vector_length; for loop constructs, valid
clauses include gang, worker, vector, seq, independent, auto, and tile [1].
There should also be checks for the legality of these inputs. If clauses
behave similarly across directives, this would simplify implementation, but
the specification does not confirm this uniformly. A manual analysis is
required to determine consistent clause behaviors across different
constructs.

Our goal is to design an implementation strategy for device_type that
generalizes across different constructs, integrating with GCC's modular
clause handling system.

GCC's current implementation of OpenACC clauses is highly modular—many
constructs share clause parsing and offloading logic. By tracing the
ChangeLog of the gomp-4_0-branch and analyzing existing clauses like
num_gangs, we can understand which macros, clause parsing logic, and
backend interfaces need to be added to support the new clause to set values
for variables per device. For diretive_type , it can get parsed in the
front end for C/C++ and Fortran, lowered through AST, GENERIC, and GIMPLE,
and values for each device are stored until the backend determines the
target. At that point, unrelated clauses can be discarded during expansion.
After this filtering, we simply follow the existing logic for setting
clause values as seen in other parts of the OpenACC implementation.

The default values for device-related clauses are not hardcoded. They can
be explicitly set by users in directives, overridden by environment
variables such as GOMP_OPENACC_DIM, or computed dynamically at runtime
[11]. For example, in plugin-nvptx.c, CUDA driver APIs (like
cuDeviceGetAttribute) are used to query hardware capabilities—such as the
number of registers per multiprocessor, threads per block, and shared
memory. These values are then used to compute optimal defaults for
execution configuration. By tracing patches that fetch and set these
values—such as how GOMP_OPENACC_DIM is parsed and applied—we gain insights
into how values specified in directives are preserved and utilized
throughout the compilation process.

References

[1] OpenACC Specification: https://www.openacc.org/specification

[2] OpenACC Host Compiler Compilation Process:
https://gcc.gnu.org/wiki/Offloading#Compilation_process

[3] Improving OpenACC kernels support in GCC:
https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf

[4] Issue with acc_memcpy_device:
https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977

[5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf:
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

[6] Data Prefetch Support: https://gcc.gnu.org/projects/prefetch.html

[7]  Options That Control Optimization:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

[8] "AMD Instinct MI200" Instruction Set Architecture:
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi200-cdna2-instruction-set-architecture.pdf

[9] NVIDIA PTX ISA: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.5.pdf

[10] [openacc] Document GOMP_OPENACC_DIM:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85129

[11] [gomp4] adjust num_gangs and add a diagnostic for unsupported
num_workers: https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00834.html

[12] OpenACC routine bind:
https://forums.developer.nvidia.com/t/openacc-routine-bind/133968

Best regards,

Chenlu

On Thu, Apr 3, 2025 at 3:17 AM Thomas Schwinge <tschwi...@baylibre.com>
wrote:

> Hi Chenlu!
>
> On 2025-03-27T22:05:02+1100, Zhang lv via Gcc <gcc@gcc.gnu.org> wrote:
> >  Hi here,
>
> Welcome to GCC!
>
> > I found digging into OpenACC meaningful. It's a late start for a GSoC
> > proposal
>
> ..., but not yet too late!  :-)
>
> > and any suggestions from the community are appreciated! Feel free
> > to comment on any part of it. To save time for readers, I've outlined my
> > key understandings here—expansion and polishing are still needed.
> >
> > I'm not sure whether my understanding of each task is correct, especially
> > for the cache directive and device_type clause. Here is my current
> > understanding:
> >
> > *3. Make the OpenACC **cache** Directive Actually Do Something*
> >
> > Currently, the cache directive in OpenACC is parsed at the front end, but
> > not used for any optimization purposes, such as data prefetching or
> moving
> > data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]).
> >
> > *TODO:* My current understanding is that after the OACC_CACHE directive
> is
> > lowered to GIMPLE via gimplify_oacc_cache, a new tree-SSA optimization
> pass
> > could be added. This pass might be similar to the existing aprefetch
> pass,
> > or OpenACC prefetch logic could be integrated into it.
>
> ..., or even just synthesizing specific (GIMPLE?) constructs, so that the
> existing pass can do its work -- in case that the pass' existing support
> maps to (at least specific instances of) the functionality provided by
> the respective offloading back ends.  That'll need more research; I don't
> know yet, either.
>
> > The goal may be emitting prefetch instructions by inserting suitable
> > built-in functions and relying on the backend to map them to runtime API
> > calls or device-specific instructions through RTL templates.
> >
> > However, several questions remain:
> >
> >    - Since OpenACC supports both accelerators (e.g., GPUs) and multicore
> >    CPUs, should we handle both cases?
> >       - For CPUs, we can refer to each vendor's ISA (e.g., x86_64, Arm)
> to
> >       decide which prefetch instructions to generate.
>
> In general that's right, but note that currently GCC does not yet
> implement OpenACC support for multicore CPU, but only for AMD and Nvidia
> GPUs, or single-threaded host-fallback execution (with 'if(false)', or if
> no device is available).  But then, in addition to the code offloading
> devices, data prefetching etc. conceptually also applies to the host code
> generation.  (..., but that need not be the focus of the work, at this
> time.)
>
> >       - For GPUs, are we expected to use prefetch instructions from GPU
> ISA
> >       or Should we manual use runtime API routines like
> acc_memcpy_device to
> >       manage data?
>
> You're right that there may also be cases where we may call specific
> functions of the respective GPU driver/library, to mark up memory regions
> in specific ways, etc., but for OpenACC 'cache', my first idea would've
> been to use specific individual GPU instructions, or instruction
> variants/flags for existing instructions, as applicable.
>
> >    - Additional considerations include choosing a suitable prefetch
> >    distance, which may differ by device type or architecture.
>
> Right.
>
> Understand, that at a specific point in the compilation pipeline, the
> offloading code steam gets split off of the host code stream.  Until
> then, we'll have to keep a generic (GIMPLE) representation, and
> afterwards, we know if we're host vs. AMD GPU vs. Nvidia GPU, and which
> specific architecture we're compiling for, and can then map to specific
> IR representations, and finally specific GPU instructions etc.
>
> > *5. OpenACC **device_type** Clause*
> >
> > *TODO:* Is the device_type clause designed to allow users to manually
> > specify the target platform in source code, rather than via compiler
> > options like -foffload=amdgcn-amdhsa="-march=gfx900"
> >
> > , or compiler building options like--target=nvptx-none?
>
> Yet different; see the OpenACC specification.  High-level: if you have a
> toolchain targeting several offload devices, it allows for tuning OpenACC
> processing *differently* per 'device_type'.  For example (untested, and
> not very clever like this):
>
>     #pragma acc parallel loop device_type(nvidia) num_gangs(100)
> device_type(radeon) num_gangs(50)
>
> Nvidia GPU offloading compilation/execution will use 'num_gangs(100)',
> AMD GPU offloading compilation/execution will use 'num_gangs(50)'.
>
> > My understanding for other task is as follows:
> >
> >
> > *1. OpenACC **acc_memcpy_device** Runtime API Routine*
> >
> > The acc_memcpy_device routine is currently missing in GCC's OpenACC
> runtime
> > implementation. According to the specification, this routine copies a
> > specified number of bytes from one device address (data_dev_src) to
> another
> > device address (data_dev_dest). Both addresses must reside in the current
> > device’s memory. There is also an asynchronous variant that performs the
> > data transfer on a specified async queue (async_arg). The routine must
> > handle error cases such as null pointers and invalid async argument
> values.
> > This function shares a similar implementation pattern with
> > acc_memcpy_to_device and acc_memcpy_from_device, which transfer data
> > between host and device memory.
> >
> > Implementation will mainly involve modifying the following files:
> >
> >    - libgomp/libgomp.map
> >    - libgomp/openacc.h
> >    - libgomp/openacc_lib.h
> >    - libgomp/openacc.f90
> >    - libgomp/oacc-mem.c
> >
> > The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host,
> > and gomp_device_copy were primarily designed for acc_memcpy_to_device and
> > acc_memcpy_from_device, which handle host-device transfers.
>
> That's right, that's the generic code -- but what additional
> functions/files do the existing routines use, to actually implement the
> functionality at the level of each specific device?
>
> > For
> > acc_memcpy_device, which handles device-to-device transfers, we should
> > design a similar logic. Further investigation is needed to structure and
> > implement this functionality effectively.
>
> Right.
>
> > *2. Support for **init**, **shutdown**, and **set** Directives*
> >
> > These directives are currently unsupported at the front-end level in GCC,
> > even though their corresponding runtime APIs—acc_init, acc_shutdown,
> > acc_set_device_type, and their async queue variants—are implemented. The
> > goal here is to add parsing support in the front end to map these
> > directives to the appropriate built-in functions. In GCC, front ends map
> > OpenACC directives to BUILT_IN_GOACC_* entries defined in
> > gcc/omp-builtins.def, and the back end expands these into runtime API
> > calls. This task involves analyzing and extending front-end source files,
> > taking inspiration from the implementation of the wait directive.
> Relevant
> > files include:
> >
> >
> >    - gcc/c-family/c-omp.cc
> >    - gcc/c/c-parser.cc
> >    - gcc/cp/parser.cc
> >    - gcc/fortran/trans-openmp.cc
> >    - gcc/omp-builtins.def
> >    - gcc/omp-oacc-kernels-decompose.cc
>
> Right.
>
> > *4. OpenACC **bind** Clause Support*
> >
> > The bind clause appears in the routine directive
>
> Right.
>
> > and applies at the
> > sequential level of parallelism.
>
> Please look into that, again.
>
> > The following restrictions must be
> > enforced:
> >
> >    - A routine may not bind to a name that already has a visible bind
> >    clause.
> >    - If a procedure has a bind clause on both its declaration and
> >    definition, they must bind to the same name.
> >    - When compiling for multicore host CPUs, any bind clause should be
> >    ignored.
> >    - A bind clause must not appear after a device_type(host) clause.
> >
> > These cases should be carefully validated during semantic analysis. We
> can
> > also use internal control variables (ICV) like default-device-var to
> inform
> > bind behavior.
>
> That'll need some clarification.  What does the 'bind' clause actually
> do, how could this be implemented in the compiler?
>
> > And my understanding about the background is as follows:
> >
> >
> > Introduction
> >
> > OpenACC is a directive-based parallel programming model designed for
> > heterogeneous HPC hardware. However, GCC currently only partially
> supports
> > the features specified in OpenACC 2.6: some directives are not parsed at
> > all, some are parsed at the front end but are not lowered to generate the
> > appropriate runtime API calls, and the runtime library implementation in
> > GCC is also incomplete. This project aims to address these gaps by
> > proposing feasible solutions to enhance GCC’s OpenACC support to more
> > closely align with the official specification.
>
> Right, and we may of course also look into other OpenACC features that
> appeared with later releases of the OpenACC specifications; often (but
> not always) the features are orthogonal, meaning that later versions'
> features may be implemented even if not (all of) the former ones have
> been.
>
>
> Grüße
>  Thomas
>
>
> > Background
> >
> > OpenACC is a parallel programming model for heterogeneous HPC hardware,
> > abstracted into two parts: the host and the attached parallel
> accelerator,
> > such as a GPU. It provides compiler directives (e.g., in C/C++: #pragma
> acc
> > directive-name [clause-list]) that allow users to specify
> compute-intensive
> > regions of a program to be offloaded to an accelerator or executed on
> > multiple host cores under the control of a host thread. The* execution
> > model* is host-directed: the host thread manages memory allocation on the
> > accelerator, initiates data transfers, sends code and arguments to the
> > device, queues tasks, waits for completion, retrieves results, and
> > deallocates memory. A key aspect of OpenACC is its *memory model*:
> > accelerator memory is often separate from host memory, requiring explicit
> > data transfers handled by the OpenACC runtime through underlying system
> > calls such as direct memory access (DMA) transfers. Nowadays, most
> > accelerators include caches, and OpenACC requires the compiler to manage
> > these caches [1].
> >
> > GCC parses code containing OpenACC directives written in C/C++ or Fortran
> > and uses the OpenMP runtime API routines from the libgomp
> library—developed
> > by GCC—to implement the functionality of each directive. At runtime,
> libgomp
> > can look up and launch an offload function when given a target function
> > address [3]. These target functions are linked to libgomp plugins, which
> > are loaded from the standard dynamic linker path. For example, the plugin
> > for Intel MIC devices uses liboffloadmic.so, while the plugin for NVIDIA
> > PTX devices uses libcuda.so [2]. These loaded plugins rely on
> third-party,
> > target-specific libraries to perform low-level interactions with
> > accelerator devices. In short, libgomp is designed to be independent of
> > specific accelerator architectures—it exposes a generic interface and
> > delegates all target-dependent functionality to plugins. These plugins
> are
> > developed collaboratively by the GNU community and hardware vendors.
> >
> >
> >
> > References
> >
> > [1] OpenACC Specification: https://www.openacc.org/specification
> >
> > [2] OpenAcc Host Compiler Compilation Process:
> > https://gcc.gnu.org/wiki/Offloading#Compilation_process
> >
> > [3] Improving OpenACC kernels support in GCC:
> >
> https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf
> >
> > [4] Issue with acc_memcpy_device
> >
> https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977
> >
> > [5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf:
> >
> https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
> >
> >
> > Best regards,
> >
> > Chenlu
>

Re: [GSoC][Enhance OpenACC support] Uncertain about Cache directive functionality and device_type clause goal

Reply via email to