Hi Chung-Lin! On 2019-09-10T19:41:59+0800, Chung-Lin Tang <chunglin_t...@mentor.com> wrote: > this is a completely new implementation of an earlier optimization > that Cesar submitted: > https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01202.html
Thanks for your re-work! > The objective is to transform the original single-record-pointer argument > form (OpenMP/pthreads originated) to multiple scalar parameters, that > the CUDA runtime will place directly in the .params space for GPU kernels: > > #pragma acc parallel copy(a, b) copyin(c) > { > a += b; > b -= c; > } > > compiles to GIMPLE as: > > __attribute__((oacc function (1, 1, 32), omp target entrypoint)) > main._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) > { > ... > _3 = .omp_data_i_2(D)->a; > _4 = *_3; > _5 = .omp_data_i_2(D)->b; > _6 = *_5; > ... > > this patch adds pass to transform into: > > __attribute__((oacc function (1, 1, 32), omp target entrypoint)) > main._omp_fn.0 (int * c, int * b, int * a) > { > ... > _3 = a; > _4 = *_3; > _5 = b; > _6 = *_5; > ... ACK. > Cesar's original implementation tried to do this in the middle-end, > which required lots of changes throughout the compiler, libgomp interface, > etc. and required a dependency on libffi for the CPU-host fallback child > function (since there is no longer a known, fixed single-pointer argument > interface to all child functions) Specifically, the major problem -- per my understanding -- is that Cesar's implementation does this in the early stages of the middle end ('pass_lower_omp'), before the target vs. offload target code paths get separated, and so the transformation was done for target ("host fallback") as well as all offload targets, without each of them having the option to opt in/out. As can be seen from the new highly localized code changes (nvptx code only), your re-work clearly fixes that aspect! :-) > This new implementation works by modifying the GIMPLE for child functions > directly at the very start (before, actually) of RTL expansion That's now near the other end of the pipeline. ;-) What's the motivation for putting it there, instead of early in the nvptx offloading compilation (around 'pass_oacc_device_lower' etc. time, where I would've assumed this transformation to be done)? Not asking you to change that now, but curious for the reason. > and thus > is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently > need something different generated between the host-fallback vs for the GPU. (Likewise, different per each offload target.) > The new nvptx_expand_to_rtl_hook modifies the function decl type and > arguments, and scans the gimple body to remove occurrences of .omp_data_i.* > Detection of OpenACC child functions is done through "omp target entrypoint" > and "oacc function" attributes. Because OpenMP target child functions > have a more elaborate wrapper generated for them, this pass only supports > OpenACC right now. At the Cauldron, the question indeed has been raised (Jakub, Tom) why not enabled for OpenMP, too. My answer was that this surely can be done, but the change as presented here already is an improvement over the current status ("stands on its own", as Jeff Law would call it), so I'm fine with you handling OpenACC first, and then OpenMP can follow later (at some as of yet indeterminite point in time, even). > libgomp has tested with this patch x86_64-linux (nvptx-none accelerator) > without regressions Can you present performance numbers, too? > (I'm currently undergoing more gcc tests as well). As these changes, being confined to nvptx code only, can't possibly have any effect on other target testing, I assume that's nvptx target testing you're talking about? (..., where also I'm not expecting any disturbance.) > Is this okay for trunk? I'm not the one to approve these code changes, but I do have a few comments/questions: > --- gcc/config/nvptx/nvptx.c (revision 275493) > +++ gcc/config/nvptx/nvptx.c (working copy) > +static void > +nvptx_expand_to_rtl_hook (void) > +{ > + /* For utilizing CUDA .param kernel arguments, we detect and modify > + the gimple of offloaded child functions, here before RTL expansion, > + starting with standard OMP form: > + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { > ... } > + > + and transform it into a style where the OMP data record fields are > + "exploded" into individual scalar arguments: > + foo._omp_fn.0 (int * a, int * b, int * c) { ... } > + > + Note that there are implicit assumptions of how OMP lowering (and/or > other > + intervening passes) behaves contained in this transformation code; > + if those passes change in their output, this code may possibly need > + updating. */ > + > + if (lookup_attribute ("omp target entrypoint", > + DECL_ATTRIBUTES (current_function_decl)) > + /* The rather indirect manner in which OpenMP target functions are > + launched makes this transformation only valid for OpenACC currently. > + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. > + needs changes for this to work with OpenMP. */ > + && lookup_attribute ("oacc function", > + DECL_ATTRIBUTES (current_function_decl)) > + && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))) Why the 'void' return conditional? (Or, should that rather be an 'gcc_checking_assert' at the top of the following block?) > + { > + tree omp_data_arg = DECL_ARGUMENTS (current_function_decl); > + tree argtype = TREE_TYPE (omp_data_arg); > + > + /* Ensure this function is of the form of a single reference argument > + to the OMP data record, or a single void* argument (when no values > + passed) */ > + if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE > + && ((TREE_CODE (argtype) == REFERENCE_TYPE > + && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE) > + || (TREE_CODE (argtype) == POINTER_TYPE > + && TREE_TYPE (argtype) == void_type_node)))) > + return; Again, is that something we should 'gcc_checking_assert', so that we'll notice when something changes/breaks? Given your note above, "there are implicit assumptions [on] OMP lowering", I'd assume that this code here does quite some 'gcc_checking_assert'ions to make sure that we're within the expected bounds. > + /* Remove local decls which correspond to *.omp_data_i->FIELD entries, > by > + scanning and skipping those entries, creating a new local_decls list. > + We assume a very specific MEM_REF tree expression shape. */ > + tree decl; > + unsigned int i; > + vec<tree, va_gc> *new_local_decls = NULL; > + FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl) > + { > + if (DECL_HAS_VALUE_EXPR_P (decl)) > + { > + tree t = DECL_VALUE_EXPR (decl); > + if (TREE_CODE (t) == MEM_REF > + && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF > + && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == > MEM_REF > + && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0) > + == omp_data_arg)) > + continue; > + } > + vec_safe_push (new_local_decls, decl); > + } > + vec_free (cfun->local_decls); > + cfun->local_decls = new_local_decls; Is it worth doing that manually, or can/should some dead code elimination pass deal with that? > + /* Scan function body for assignments from .omp_data_i->FIELD, and > using > + the above created fld_to_args hash map, convert them to reads of > + function arguments. */ > + else if (TREE_CODE (val) == MEM_REF > + && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME > + && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg) > + { > + /* This case may happen in the final tree level optimization > + output, due to SLP: > + vect.XX = MEM <vector(1) unsigned long> [(void > *).omp_data_i_5(D) + 8B] > + > + Therefore here we need a more elaborate search of the field > + list to reverse map to which field the offset is referring > + to. */ Would this be simpler if the conversion would be done earlier? (And I mentioned above.) > + /* If we found the corresponding OMP data record field, replace the > + RHS with the new created PARM_DECL. */ > + if (new_val != NULL_TREE) > + { > + if (dump_file) > + { > + fprintf (dump_file, "For gimple stmt: "); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n", > + print_generic_expr_to_str (val), > + print_generic_expr_to_str (new_val)); > + } > + /* Write in looked up ARG as new RHS value. */ > + *val_ptr = new_val; > + } If 'new_val == NULL_TREE' that simply means that we've been looking at something that doesn't need to be handled here, right? > + /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE. */ > + tree name; > + FOR_EACH_SSA_NAME (i, name, cfun) > + if (SSA_NAME_VAR (name) == omp_data_arg) > + (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE; Again, manual cleanup vs. automated? > --- libgomp/plugin/plugin-nvptx.c (revision 275493) > +++ libgomp/plugin/plugin-nvptx.c (working copy) > @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void > unsigned *dims, void *targ_mem_desc, > struct goacc_asyncqueue *aq) > { > [...] > - if (mapnum > 0) > - GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); > + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream); > } Wasn't that the only user of 'GOMP_OFFLOAD_openacc_async_queue_callback'? Grüße Thomas