Hi Thomas, Thanks for your detailed response! I've updated the proposal based on the feedback. Please kindly check it out. Thanks a lot!
Project Goals and Tasks GCC currently only partially supports the features specified in OpenACC 2.6. This project aims to enhance GCC's OpenACC support in the following areas: *1. OpenACC **acc_memcpy_device** Runtime API Routine* The acc_memcpy_device routine is currently missing in GCC's OpenACC runtime implementation. According to the specification, this routine copies a specified number of bytes from one device address (data_dev_src) to another device address (data_dev_dest). Both addresses must reside in the current device’s memory. There is also an asynchronous variant that performs the data transfer on a specified async queue (async_arg). The routine must handle error cases such as null pointers and invalid async argument values. This function shares a similar implementation pattern with acc_memcpy_to_device and acc_memcpy_from_device, which transfer data between host and device memory. Implementation will mainly involve modifying the following files: - libgomp/libgomp.map - libgomp/openacc.h - libgomp/openacc_lib.h - libgomp/openacc.f90 - libgomp/oacc-mem.c - libgomp/libgomp.h - libgomp/target.c - libgomp/plugin/plugin-nvptx.c - libgomp/plugin/plugin-gcn.c The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host, gomp_device_copy , and gomp_device_copy_async were primarily designed for acc_memcpy_to_device and acc_memcpy_from_device, which handle host-device transfers. At the device-specific level, these functions delegate low-level memory operations to libgomp plugins (e.g., libgomp/plugin/plugin-nvptx.c for NVIDIA and libgomp/plugin/plugin-gcn.c for AMD), which are dynamically loaded from shared object (.so) files and use target runtime APIs (like CUDA or ROCm) to allocate memory, launch kernels, and manage transfers. For acc_memcpy_device, which handles device-to-device transfers, we should design a similar logic. Further investigation is needed to structure and implement this functionality effectively. *2. Support for **init**, **shutdown**, and **set** Directives* These directives are currently unsupported at the front-end level in GCC, even though their corresponding runtime APIs—acc_init, acc_shutdown, acc_set_device_type, and their async queue variants—are implemented. The goal here is to add parsing support in the front end to map these directives to the appropriate built-in functions. In GCC, front ends map OpenACC directives to BUILT_IN_GOACC_* entries defined in gcc/omp-builtins.def, and the back end expands these into runtime API calls. This task involves analyzing and extending front-end source files, taking inspiration from the implementation of the wait directive. Relevant files include: - gcc/c-family/c-omp.cc - gcc/c/c-parser.cc - gcc/cp/parser.cc - gcc/fortran/trans-openmp.cc - gcc/omp-builtins.def - gcc/omp-oacc-kernels-decompose.cc *3. Make the OpenACC **cache** Directive Actually Do Something* Currently, the cache directive in OpenACC is parsed at the front end, but not used for any optimization purposes, such as data prefetching or moving data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]). We can leverage existing prefetch support in GCC, such as the middle-end aprefetch pass, which analyzes nested loop structures and inserts __builtin_prefetch calls in GIMPLE for the innermost loop. These are later expanded by the backend expand pass to target-specific prefetch instructions based on the ISA. On CPUs, __builtin_prefetch is typically mapped to prefetch instructions in the CPU ISA. By enabling options like -fopenacc and setting proper flags, we can also define a customized __builtin_prefetch_openacc call and write new RTL templates to map prefetch calls to GPU-specific instructions—if such instructions exist for the target architecture. This will be discussed at the end of this section. To be more specific about how we can leverage existing infrastructure: when the OACC_CACHE directive is lowered to GIMPLE via gimplify_oacc_cache, we can insert __builtin_prefetch_openacc calls at appropriate locations using gimple_build_call and gsi_insert_before in the aprefetch pass located in tree-ssa-loop-prefetch.cc. These built-ins will then be picked up by the backend RTL expansion pass. However, to the best of my knowledge, there is currently no backend support for expanding prefetch instructions for GPU targets. Therefore, additional backend work is needed to implement the necessary RTL patterns or instruction selection hooks for GPU architectures. Aside from the aprefetch pass enabled by the -fprefetch-loop-arrays option, GCC also provides several tuning parameters via --param, such as prefetch-latency, simultaneous-prefetches, and sched-autopref-queue-depth [7], to control how aggressive or conservative the prefetching should be. All of these parameters influence the aprefetch pass's analysis of loop structure, so they generally require no special handling from us. However, they may restrict how many prefetch instructions can be issued effectively. These limitations stem from hardware resource models. Since we're manually issuing software prefetch instructions, we can avoid complex modeling for now and focus on building the basic infrastructure. This differs from hardware prefetching, which is handled automatically by the processor and may interact with software prefetching. While understanding this interaction is useful, it is beyond the scope of this project and can be considered in future work. Both AMD and NVIDIA GPUs support ISA-level prefetching, which simplifies implementing OpenACC’s cache directive. For AMD CDNA GPUs [8], prefetch behavior can be achieved using load instructions like BUFFER_LOAD_DWORD with cache-control flags (e.g., SC1, SC0, NT). Similarly, NVIDIA GPUs support dedicated prefetch instructions in the PTX ISA [9], such as prefetch.global.L1 [a] and prefetchu.L1 [a], which allow prefetching to various cache levels and memory spaces. These capabilities enable the GCC RTL backend to map __builtin_prefetch_openacc to the appropriate target instructions or intrinsics, depending on the GPU architecture. However, if the available GPU instructions are insufficient to cover all prefetching use cases, we should consider falling back to runtime API routines from libgomp, such as acc_memcpy_device, to manually manage data movement. *4. OpenACC **bind** Clause Support* This section introduces what the bind clause does in OpenACC, analyzes how the existing GCC infrastructure supports routine directives, and proposes a path for implementing full bind clause support. The bind clause in OpenACC lets you link a device function to a host function name using the routine directive. This means you should define one function for the host and a separate one for the device. When calling the function on the device inside an OpenACC compute region, the device version you specified with bind will be used. For example: #pragma acc routine worker bind(sum1) int sum(int, float *); int sum(int n,float *A) { int i; float s=0.0f; for(i=0;i<n;i++){ s=s+A[i]; } return s; } #pragma acc routine worker int sum1(int n,float *A) { int i; float s=0.0f; #pragma acc loop vector reduction(+:s) for(i=0;i<n;i++){ s=s+A[i]+2; } return s; } By default, when a function is declared with #pragma acc routine, the compiler generates two versions of the function: one for the host and one for the target device. The bind clause modifies this behavior by directing the compiler to use the device version of an alternative function—sum1 in this case—when the function is called within an OpenACC compute region. This means that: - If the function sum is called from host code, the host implementation of sum will be invoked. - If the same function sum is called from within an OpenACC parallel region, the device code for sum1 will be used instead. This mechanism allows for greater flexibility in mapping host-visible API functions to optimized or device-specific versions, facilitating performance tuning or usage of hand-optimized GPU code. Additionally, the bind clause can reference a CUDA device function, enabling integration with low-level CUDA kernels or library routines when compiling OpenACC code for NVIDIA GPUs. Related GCC Files and Functions: - gcc/omp-general.cc - oacc_build_routine_dims: Generates attributes for parallelism level. - oacc_verify_routine_clauses: Verifies and normalizes routine clauses. - gcc/c/c-parser.c: c_finish_oacc_routine calls oacc_verify_routine_clauses. - gcc/cp/parser.c: cp_finalize_oacc_routine calls oacc_verify_routine_clauses. - gcc/fortran/f95-lang.c and trans-decl.c: Handle OpenACC routine clauses via attribute logic. To implement support for the bind clause in GCC: - Clause Parsing Integration: Extend the parser to recognize and store the bind clause during front-end processing. - Function Binding Mechanism: Add logic to map the host function name to the target device function as specified in the bind clause. - Host-Device Code Generation: Study how the host version of the function is generated and invoked. GCC uses two compilers during offloading: one for the host and one for the accelerator. After parsing, the backend expands functions to separate addresses. - Linking Investigation: Investigate whether linking affects this mapping. Specifically, study the generation and role of: - .gnu.lto_.offload_table: IR for declarations from the host compiler. - .gnu.offload_lto_.offload_table: IR for declarations from the accelerator compiler. *5. OpenACC **device_type** Clause* The device_type clause in OpenACC enables developers to fine-tune execution behavior per target architecture (e.g., NVIDIA or AMD GPUs) within the same directive. The allowed clauses following device_type differ by directive: for compute constructs (parallel, kernels), valid clauses include async, wait, num_gangs, num_workers, and vector_length; for loop constructs, valid clauses include gang, worker, vector, seq, independent, auto, and tile [1]. There should also be checks for the legality of these inputs. If clauses behave similarly across directives, this would simplify implementation, but the specification does not confirm this uniformly. A manual analysis is required to determine consistent clause behaviors across different constructs. Our goal is to design an implementation strategy for device_type that generalizes across different constructs, integrating with GCC's modular clause handling system. GCC's current implementation of OpenACC clauses is highly modular—many constructs share clause parsing and offloading logic. By tracing the ChangeLog of the gomp-4_0-branch and analyzing existing clauses like num_gangs, we can understand which macros, clause parsing logic, and backend interfaces need to be added to support the new clause to set values for variables per device. For diretive_type , it can get parsed in the front end for C/C++ and Fortran, lowered through AST, GENERIC, and GIMPLE, and values for each device are stored until the backend determines the target. At that point, unrelated clauses can be discarded during expansion. After this filtering, we simply follow the existing logic for setting clause values as seen in other parts of the OpenACC implementation. The default values for device-related clauses are not hardcoded. They can be explicitly set by users in directives, overridden by environment variables such as GOMP_OPENACC_DIM, or computed dynamically at runtime [11]. For example, in plugin-nvptx.c, CUDA driver APIs (like cuDeviceGetAttribute) are used to query hardware capabilities—such as the number of registers per multiprocessor, threads per block, and shared memory. These values are then used to compute optimal defaults for execution configuration. By tracing patches that fetch and set these values—such as how GOMP_OPENACC_DIM is parsed and applied—we gain insights into how values specified in directives are preserved and utilized throughout the compilation process. References [1] OpenACC Specification: https://www.openacc.org/specification [2] OpenACC Host Compiler Compilation Process: https://gcc.gnu.org/wiki/Offloading#Compilation_process [3] Improving OpenACC kernels support in GCC: https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf [4] Issue with acc_memcpy_device: https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977 [5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf [6] Data Prefetch Support: https://gcc.gnu.org/projects/prefetch.html [7] Options That Control Optimization: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html [8] "AMD Instinct MI200" Instruction Set Architecture: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi200-cdna2-instruction-set-architecture.pdf [9] NVIDIA PTX ISA: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.5.pdf [10] [openacc] Document GOMP_OPENACC_DIM: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85129 [11] [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers: https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00834.html [12] OpenACC routine bind: https://forums.developer.nvidia.com/t/openacc-routine-bind/133968 Best regards, Chenlu On Thu, Apr 3, 2025 at 3:17 AM Thomas Schwinge <tschwi...@baylibre.com> wrote: > Hi Chenlu! > > On 2025-03-27T22:05:02+1100, Zhang lv via Gcc <gcc@gcc.gnu.org> wrote: > > Hi here, > > Welcome to GCC! > > > I found digging into OpenACC meaningful. It's a late start for a GSoC > > proposal > > ..., but not yet too late! :-) > > > and any suggestions from the community are appreciated! Feel free > > to comment on any part of it. To save time for readers, I've outlined my > > key understandings here—expansion and polishing are still needed. > > > > I'm not sure whether my understanding of each task is correct, especially > > for the cache directive and device_type clause. Here is my current > > understanding: > > > > *3. Make the OpenACC **cache** Directive Actually Do Something* > > > > Currently, the cache directive in OpenACC is parsed at the front end, but > > not used for any optimization purposes, such as data prefetching or > moving > > data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]). > > > > *TODO:* My current understanding is that after the OACC_CACHE directive > is > > lowered to GIMPLE via gimplify_oacc_cache, a new tree-SSA optimization > pass > > could be added. This pass might be similar to the existing aprefetch > pass, > > or OpenACC prefetch logic could be integrated into it. > > ..., or even just synthesizing specific (GIMPLE?) constructs, so that the > existing pass can do its work -- in case that the pass' existing support > maps to (at least specific instances of) the functionality provided by > the respective offloading back ends. That'll need more research; I don't > know yet, either. > > > The goal may be emitting prefetch instructions by inserting suitable > > built-in functions and relying on the backend to map them to runtime API > > calls or device-specific instructions through RTL templates. > > > > However, several questions remain: > > > > - Since OpenACC supports both accelerators (e.g., GPUs) and multicore > > CPUs, should we handle both cases? > > - For CPUs, we can refer to each vendor's ISA (e.g., x86_64, Arm) > to > > decide which prefetch instructions to generate. > > In general that's right, but note that currently GCC does not yet > implement OpenACC support for multicore CPU, but only for AMD and Nvidia > GPUs, or single-threaded host-fallback execution (with 'if(false)', or if > no device is available). But then, in addition to the code offloading > devices, data prefetching etc. conceptually also applies to the host code > generation. (..., but that need not be the focus of the work, at this > time.) > > > - For GPUs, are we expected to use prefetch instructions from GPU > ISA > > or Should we manual use runtime API routines like > acc_memcpy_device to > > manage data? > > You're right that there may also be cases where we may call specific > functions of the respective GPU driver/library, to mark up memory regions > in specific ways, etc., but for OpenACC 'cache', my first idea would've > been to use specific individual GPU instructions, or instruction > variants/flags for existing instructions, as applicable. > > > - Additional considerations include choosing a suitable prefetch > > distance, which may differ by device type or architecture. > > Right. > > Understand, that at a specific point in the compilation pipeline, the > offloading code steam gets split off of the host code stream. Until > then, we'll have to keep a generic (GIMPLE) representation, and > afterwards, we know if we're host vs. AMD GPU vs. Nvidia GPU, and which > specific architecture we're compiling for, and can then map to specific > IR representations, and finally specific GPU instructions etc. > > > *5. OpenACC **device_type** Clause* > > > > *TODO:* Is the device_type clause designed to allow users to manually > > specify the target platform in source code, rather than via compiler > > options like -foffload=amdgcn-amdhsa="-march=gfx900" > > > > , or compiler building options like--target=nvptx-none? > > Yet different; see the OpenACC specification. High-level: if you have a > toolchain targeting several offload devices, it allows for tuning OpenACC > processing *differently* per 'device_type'. For example (untested, and > not very clever like this): > > #pragma acc parallel loop device_type(nvidia) num_gangs(100) > device_type(radeon) num_gangs(50) > > Nvidia GPU offloading compilation/execution will use 'num_gangs(100)', > AMD GPU offloading compilation/execution will use 'num_gangs(50)'. > > > My understanding for other task is as follows: > > > > > > *1. OpenACC **acc_memcpy_device** Runtime API Routine* > > > > The acc_memcpy_device routine is currently missing in GCC's OpenACC > runtime > > implementation. According to the specification, this routine copies a > > specified number of bytes from one device address (data_dev_src) to > another > > device address (data_dev_dest). Both addresses must reside in the current > > device’s memory. There is also an asynchronous variant that performs the > > data transfer on a specified async queue (async_arg). The routine must > > handle error cases such as null pointers and invalid async argument > values. > > This function shares a similar implementation pattern with > > acc_memcpy_to_device and acc_memcpy_from_device, which transfer data > > between host and device memory. > > > > Implementation will mainly involve modifying the following files: > > > > - libgomp/libgomp.map > > - libgomp/openacc.h > > - libgomp/openacc_lib.h > > - libgomp/openacc.f90 > > - libgomp/oacc-mem.c > > > > The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host, > > and gomp_device_copy were primarily designed for acc_memcpy_to_device and > > acc_memcpy_from_device, which handle host-device transfers. > > That's right, that's the generic code -- but what additional > functions/files do the existing routines use, to actually implement the > functionality at the level of each specific device? > > > For > > acc_memcpy_device, which handles device-to-device transfers, we should > > design a similar logic. Further investigation is needed to structure and > > implement this functionality effectively. > > Right. > > > *2. Support for **init**, **shutdown**, and **set** Directives* > > > > These directives are currently unsupported at the front-end level in GCC, > > even though their corresponding runtime APIs—acc_init, acc_shutdown, > > acc_set_device_type, and their async queue variants—are implemented. The > > goal here is to add parsing support in the front end to map these > > directives to the appropriate built-in functions. In GCC, front ends map > > OpenACC directives to BUILT_IN_GOACC_* entries defined in > > gcc/omp-builtins.def, and the back end expands these into runtime API > > calls. This task involves analyzing and extending front-end source files, > > taking inspiration from the implementation of the wait directive. > Relevant > > files include: > > > > > > - gcc/c-family/c-omp.cc > > - gcc/c/c-parser.cc > > - gcc/cp/parser.cc > > - gcc/fortran/trans-openmp.cc > > - gcc/omp-builtins.def > > - gcc/omp-oacc-kernels-decompose.cc > > Right. > > > *4. OpenACC **bind** Clause Support* > > > > The bind clause appears in the routine directive > > Right. > > > and applies at the > > sequential level of parallelism. > > Please look into that, again. > > > The following restrictions must be > > enforced: > > > > - A routine may not bind to a name that already has a visible bind > > clause. > > - If a procedure has a bind clause on both its declaration and > > definition, they must bind to the same name. > > - When compiling for multicore host CPUs, any bind clause should be > > ignored. > > - A bind clause must not appear after a device_type(host) clause. > > > > These cases should be carefully validated during semantic analysis. We > can > > also use internal control variables (ICV) like default-device-var to > inform > > bind behavior. > > That'll need some clarification. What does the 'bind' clause actually > do, how could this be implemented in the compiler? > > > And my understanding about the background is as follows: > > > > > > Introduction > > > > OpenACC is a directive-based parallel programming model designed for > > heterogeneous HPC hardware. However, GCC currently only partially > supports > > the features specified in OpenACC 2.6: some directives are not parsed at > > all, some are parsed at the front end but are not lowered to generate the > > appropriate runtime API calls, and the runtime library implementation in > > GCC is also incomplete. This project aims to address these gaps by > > proposing feasible solutions to enhance GCC’s OpenACC support to more > > closely align with the official specification. > > Right, and we may of course also look into other OpenACC features that > appeared with later releases of the OpenACC specifications; often (but > not always) the features are orthogonal, meaning that later versions' > features may be implemented even if not (all of) the former ones have > been. > > > Grüße > Thomas > > > > Background > > > > OpenACC is a parallel programming model for heterogeneous HPC hardware, > > abstracted into two parts: the host and the attached parallel > accelerator, > > such as a GPU. It provides compiler directives (e.g., in C/C++: #pragma > acc > > directive-name [clause-list]) that allow users to specify > compute-intensive > > regions of a program to be offloaded to an accelerator or executed on > > multiple host cores under the control of a host thread. The* execution > > model* is host-directed: the host thread manages memory allocation on the > > accelerator, initiates data transfers, sends code and arguments to the > > device, queues tasks, waits for completion, retrieves results, and > > deallocates memory. A key aspect of OpenACC is its *memory model*: > > accelerator memory is often separate from host memory, requiring explicit > > data transfers handled by the OpenACC runtime through underlying system > > calls such as direct memory access (DMA) transfers. Nowadays, most > > accelerators include caches, and OpenACC requires the compiler to manage > > these caches [1]. > > > > GCC parses code containing OpenACC directives written in C/C++ or Fortran > > and uses the OpenMP runtime API routines from the libgomp > library—developed > > by GCC—to implement the functionality of each directive. At runtime, > libgomp > > can look up and launch an offload function when given a target function > > address [3]. These target functions are linked to libgomp plugins, which > > are loaded from the standard dynamic linker path. For example, the plugin > > for Intel MIC devices uses liboffloadmic.so, while the plugin for NVIDIA > > PTX devices uses libcuda.so [2]. These loaded plugins rely on > third-party, > > target-specific libraries to perform low-level interactions with > > accelerator devices. In short, libgomp is designed to be independent of > > specific accelerator architectures—it exposes a generic interface and > > delegates all target-dependent functionality to plugins. These plugins > are > > developed collaboratively by the GNU community and hardware vendors. > > > > > > > > References > > > > [1] OpenACC Specification: https://www.openacc.org/specification > > > > [2] OpenAcc Host Compiler Compilation Process: > > https://gcc.gnu.org/wiki/Offloading#Compilation_process > > > > [3] Improving OpenACC kernels support in GCC: > > > https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf > > > > [4] Issue with acc_memcpy_device > > > https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977 > > > > [5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf: > > > https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf > > > > > > Best regards, > > > > Chenlu >