Re: [GSoC][Enhance OpenACC support] Uncertain about Cache directive functionality and device_type clause goal

Thomas Schwinge Wed, 02 Apr 2025 09:20:49 -0700

Hi Chenlu!

On 2025-03-27T22:05:02+1100, Zhang lv via Gcc <gcc@gcc.gnu.org> wrote:
>  Hi here,


Welcome to GCC!

> I found digging into OpenACC meaningful. It's a late start for a GSoC
> proposal

..., but not yet too late!  :-)

> and any suggestions from the community are appreciated! Feel free
> to comment on any part of it. To save time for readers, I've outlined my
> key understandings here—expansion and polishing are still needed.
>
> I'm not sure whether my understanding of each task is correct, especially
> for the cache directive and device_type clause. Here is my current
> understanding:
>
> *3. Make the OpenACC **cache** Directive Actually Do Something*
>
> Currently, the cache directive in OpenACC is parsed at the front end, but
> not used for any optimization purposes, such as data prefetching or moving
> data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]).
>
> *TODO:* My current understanding is that after the OACC_CACHE directive is
> lowered to GIMPLE via gimplify_oacc_cache, a new tree-SSA optimization pass
> could be added. This pass might be similar to the existing aprefetch pass,
> or OpenACC prefetch logic could be integrated into it.

..., or even just synthesizing specific (GIMPLE?) constructs, so that the
existing pass can do its work -- in case that the pass' existing support
maps to (at least specific instances of) the functionality provided by
the respective offloading back ends.  That'll need more research; I don't
know yet, either.

> The goal may be emitting prefetch instructions by inserting suitable
> built-in functions and relying on the backend to map them to runtime API
> calls or device-specific instructions through RTL templates.
>
> However, several questions remain:
>
>    - Since OpenACC supports both accelerators (e.g., GPUs) and multicore
>    CPUs, should we handle both cases?
>       - For CPUs, we can refer to each vendor's ISA (e.g., x86_64, Arm) to
>       decide which prefetch instructions to generate.

In general that's right, but note that currently GCC does not yet
implement OpenACC support for multicore CPU, but only for AMD and Nvidia
GPUs, or single-threaded host-fallback execution (with 'if(false)', or if
no device is available).  But then, in addition to the code offloading
devices, data prefetching etc. conceptually also applies to the host code
generation.  (..., but that need not be the focus of the work, at this
time.)

>       - For GPUs, are we expected to use prefetch instructions from GPU ISA
>       or Should we manual use runtime API routines like acc_memcpy_device to
>       manage data?

You're right that there may also be cases where we may call specific
functions of the respective GPU driver/library, to mark up memory regions
in specific ways, etc., but for OpenACC 'cache', my first idea would've
been to use specific individual GPU instructions, or instruction
variants/flags for existing instructions, as applicable.

>    - Additional considerations include choosing a suitable prefetch
>    distance, which may differ by device type or architecture.

Right.

Understand, that at a specific point in the compilation pipeline, the
offloading code steam gets split off of the host code stream.  Until
then, we'll have to keep a generic (GIMPLE) representation, and
afterwards, we know if we're host vs. AMD GPU vs. Nvidia GPU, and which
specific architecture we're compiling for, and can then map to specific
IR representations, and finally specific GPU instructions etc.

> *5. OpenACC **device_type** Clause*
>
> *TODO:* Is the device_type clause designed to allow users to manually
> specify the target platform in source code, rather than via compiler
> options like -foffload=amdgcn-amdhsa="-march=gfx900"
>
> , or compiler building options like--target=nvptx-none?

Yet different; see the OpenACC specification.  High-level: if you have a
toolchain targeting several offload devices, it allows for tuning OpenACC
processing *differently* per 'device_type'.  For example (untested, and
not very clever like this):

    #pragma acc parallel loop device_type(nvidia) num_gangs(100) 
device_type(radeon) num_gangs(50)

Nvidia GPU offloading compilation/execution will use 'num_gangs(100)',
AMD GPU offloading compilation/execution will use 'num_gangs(50)'.

> My understanding for other task is as follows:
>
>
> *1. OpenACC **acc_memcpy_device** Runtime API Routine*
>
> The acc_memcpy_device routine is currently missing in GCC's OpenACC runtime
> implementation. According to the specification, this routine copies a
> specified number of bytes from one device address (data_dev_src) to another
> device address (data_dev_dest). Both addresses must reside in the current
> device’s memory. There is also an asynchronous variant that performs the
> data transfer on a specified async queue (async_arg). The routine must
> handle error cases such as null pointers and invalid async argument values.
> This function shares a similar implementation pattern with
> acc_memcpy_to_device and acc_memcpy_from_device, which transfer data
> between host and device memory.
>
> Implementation will mainly involve modifying the following files:
>
>    - libgomp/libgomp.map
>    - libgomp/openacc.h
>    - libgomp/openacc_lib.h
>    - libgomp/openacc.f90
>    - libgomp/oacc-mem.c
>
> The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host,
> and gomp_device_copy were primarily designed for acc_memcpy_to_device and
> acc_memcpy_from_device, which handle host-device transfers.

That's right, that's the generic code -- but what additional
functions/files do the existing routines use, to actually implement the
functionality at the level of each specific device?

> For
> acc_memcpy_device, which handles device-to-device transfers, we should
> design a similar logic. Further investigation is needed to structure and
> implement this functionality effectively.

Right.

> *2. Support for **init**, **shutdown**, and **set** Directives*
>
> These directives are currently unsupported at the front-end level in GCC,
> even though their corresponding runtime APIs—acc_init, acc_shutdown,
> acc_set_device_type, and their async queue variants—are implemented. The
> goal here is to add parsing support in the front end to map these
> directives to the appropriate built-in functions. In GCC, front ends map
> OpenACC directives to BUILT_IN_GOACC_* entries defined in
> gcc/omp-builtins.def, and the back end expands these into runtime API
> calls. This task involves analyzing and extending front-end source files,
> taking inspiration from the implementation of the wait directive. Relevant
> files include:
>
>
>    - gcc/c-family/c-omp.cc
>    - gcc/c/c-parser.cc
>    - gcc/cp/parser.cc
>    - gcc/fortran/trans-openmp.cc
>    - gcc/omp-builtins.def
>    - gcc/omp-oacc-kernels-decompose.cc

Right.

> *4. OpenACC **bind** Clause Support*
>
> The bind clause appears in the routine directive

Right.

> and applies at the
> sequential level of parallelism.

Please look into that, again.

> The following restrictions must be
> enforced:
>
>    - A routine may not bind to a name that already has a visible bind
>    clause.
>    - If a procedure has a bind clause on both its declaration and
>    definition, they must bind to the same name.
>    - When compiling for multicore host CPUs, any bind clause should be
>    ignored.
>    - A bind clause must not appear after a device_type(host) clause.
>
> These cases should be carefully validated during semantic analysis. We can
> also use internal control variables (ICV) like default-device-var to inform
> bind behavior.

That'll need some clarification.  What does the 'bind' clause actually
do, how could this be implemented in the compiler?

> And my understanding about the background is as follows:
>
>
> Introduction
>
> OpenACC is a directive-based parallel programming model designed for
> heterogeneous HPC hardware. However, GCC currently only partially supports
> the features specified in OpenACC 2.6: some directives are not parsed at
> all, some are parsed at the front end but are not lowered to generate the
> appropriate runtime API calls, and the runtime library implementation in
> GCC is also incomplete. This project aims to address these gaps by
> proposing feasible solutions to enhance GCC’s OpenACC support to more
> closely align with the official specification.

Right, and we may of course also look into other OpenACC features that
appeared with later releases of the OpenACC specifications; often (but
not always) the features are orthogonal, meaning that later versions'
features may be implemented even if not (all of) the former ones have
been.


Grüße
 Thomas


> Background
>
> OpenACC is a parallel programming model for heterogeneous HPC hardware,
> abstracted into two parts: the host and the attached parallel accelerator,
> such as a GPU. It provides compiler directives (e.g., in C/C++: #pragma acc
> directive-name [clause-list]) that allow users to specify compute-intensive
> regions of a program to be offloaded to an accelerator or executed on
> multiple host cores under the control of a host thread. The* execution
> model* is host-directed: the host thread manages memory allocation on the
> accelerator, initiates data transfers, sends code and arguments to the
> device, queues tasks, waits for completion, retrieves results, and
> deallocates memory. A key aspect of OpenACC is its *memory model*:
> accelerator memory is often separate from host memory, requiring explicit
> data transfers handled by the OpenACC runtime through underlying system
> calls such as direct memory access (DMA) transfers. Nowadays, most
> accelerators include caches, and OpenACC requires the compiler to manage
> these caches [1].
>
> GCC parses code containing OpenACC directives written in C/C++ or Fortran
> and uses the OpenMP runtime API routines from the libgomp library—developed
> by GCC—to implement the functionality of each directive. At runtime, libgomp
> can look up and launch an offload function when given a target function
> address [3]. These target functions are linked to libgomp plugins, which
> are loaded from the standard dynamic linker path. For example, the plugin
> for Intel MIC devices uses liboffloadmic.so, while the plugin for NVIDIA
> PTX devices uses libcuda.so [2]. These loaded plugins rely on third-party,
> target-specific libraries to perform low-level interactions with
> accelerator devices. In short, libgomp is designed to be independent of
> specific accelerator architectures—it exposes a generic interface and
> delegates all target-dependent functionality to plugins. These plugins are
> developed collaboratively by the GNU community and hardware vendors.
>
>
>
> References
>
> [1] OpenACC Specification: https://www.openacc.org/specification
>
> [2] OpenAcc Host Compiler Compilation Process:
> https://gcc.gnu.org/wiki/Offloading#Compilation_process
>
> [3] Improving OpenACC kernels support in GCC:
> https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf
>
> [4] Issue with acc_memcpy_device
> https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977
>
> [5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf:
> https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
>
>
> Best regards,
>
> Chenlu

Re: [GSoC][Enhance OpenACC support] Uncertain about Cache directive functionality and device_type clause goal

Reply via email to