The attached patch adjusts the existing goacc validate_dims target hook and introduces a new goacc adjust_parallelism target hook. Now that vector length is no longer hard-coded to 32, there are four different ways to set it:
1) compiler default 2) explicitly via the vector_length clause 3) compile time using -fopenacc-dim or the GOMP_OPENACC_DIM environment variable 4) fallback to vector_length = 32 due to insufficient parallelism The compiler default is activated in the absence of 2) and 3). It is controlled by the macro PTX_VECTOR_LENGTH in nvptx.c. While working on this patch set, I had it set to 128 to get more test coverage. But in order to maintain backwards compatibility with acc routines (which is still a work in progress), I've kept the default vector length to 32. Besides, large vector reductions are expected to run slower until the parallel reduction finalizer is ready. The new default_dims arguments to validate_dims represents is necessary to accommodate option 3) from above. validate_dims is called after oaccdevlow has assigned parallelism to each acc loop. Prior to this patch, oaccdevlow automatically assigned parallelism to acc loops using oacc_loop_fixed_partitions and oacc_loop_auto_partitions. Both of those functions were processor-agnostic. In the case of nvptx, due to the current limitations in this patch set, the nvptx BE needs to fallback to using a vector_length of 32 whenever a vector loop is nested inside a worker loop. By supplying the parallelism mask for both the current loop and the outer loops, the goacc adjust_parallelism hook allows the back ends to fine tune any parallelism as necessary. Inside the nvptx BE, nvptx_goacc_adjust_parallelism uses a new "nvptx vl warp" function attribute to denote that the offloaded function must fallback to using a vector length of 32. Later, nvptx_goacc_validate_dims uses the attribute to adjust vector_length accordingly. Going forward, in addition to adding a new parallel reduction finalizer, the nvptx BE would benefit from merging synchronization and reduction code for combined worker-reduction loops, e.g. #pragma acc loop worker vector At present, GCC partitions acc loops with internal function markers for each level of parallelism associated with the loop. If a loop has both worker and vector level parallelism, it will have a dummy outer worker loop, and dummy inner vector loop. On CUDA hardware, there's no strong difference between workers and vectors as CUDA blocks are a loose collection of warps. Therefore, it would make more sense to merge the two loops together into a special WV loop. That would at least require some changes in the BE in addition to oacc_loop_{auto,fixed}_partitions. There were some problems in the past where CUDA hardware would lock up because the synchronization requirements for those two levels of parallelism. Merging them ought to simplify the synchronization code and enable the PTX JIT to generate better code. Overall, the changes in this patch are mild. I'll apply it to openacc-gcc-7-branch after Tom approves the reduction patch. Cesar
2018-03-02 Cesar Philippidis <ce...@codesourcery.com> gcc/ * config/nvptx/nvptx.c (NVPTX_GOACC_VL_WARP): Define. (nvptx_goacc_needs_vl_warp): New function. (nvptx_goacc_validate_dims): Add new default_dims argument and take larger vector lengths into account. (nvptx_adjust_parallelism): New function. (TARGET_GOACC_ADJUST_PARALLELISM): Define. * doc/tm.texi: Regenerate. * doc/tm.texi.in: Add placeholder for TARGET_GOACC_ADJUST_PARALLELISM. * omp-offload.c (oacc_parse_default_dims): Update usage of the targetm.goacc_valdate_dims hook. (oacc_validate_dims): Add default_dims argument. (oacc_loop_fixed_partitions): Use the adjust_parallelism hook to modify this_mask. (oacc_loop_auto_partitions): Use the adjust_parallelism hook to modify this_mask and loop->mask. (execute_oacc_device_lower): Update call to oacc_validate_dims. (default_goacc_adjust_parallelism): New function. * target.def (validate_dims): Add new default_dims argument. (adjust_parallelism): New hook. * targhooks.h (default_goacc_validate_dims): Add new argument. (default_goacc_adjust_parallelism): Declare. >From 1ee16b267dfbb0a148e8ec3b83ca463c21cbac1d Mon Sep 17 00:00:00 2001 From: Cesar Philippidis <ce...@codesourcery.com> Date: Fri, 2 Mar 2018 10:08:23 -0800 Subject: [PATCH] New target hooks --- gcc/config/nvptx/nvptx.c | 139 +++++++++++++++++++++++++++++++++++++++++++++-- gcc/doc/tm.texi | 15 +++-- gcc/doc/tm.texi.in | 2 + gcc/omp-offload.c | 35 ++++++++++-- gcc/target.def | 17 ++++-- gcc/targhooks.h | 3 +- 6 files changed, 190 insertions(+), 21 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 5642941c6a3..507c8671704 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -5205,14 +5205,36 @@ nvptx_simt_vf () return PTX_WARP_SIZE; } +#define NVPTX_GOACC_VL_WARP "nvptx vl warp" + +/* Return true of the offloaded function needs a vector_length of + PTX_WARP_SIZE. */ + +static bool +nvptx_goacc_needs_vl_warp () +{ + tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP, + DECL_ATTRIBUTES (current_function_decl)); + return attr == NULL_TREE; +} + /* Validate compute dimensions of an OpenACC offload or routine, fill in non-unity defaults. FN_LEVEL indicates the level at which a routine might spawn a loop. It is negative for non-routines. If DECL is null, we are validating the default dimensions. */ static bool -nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level) +nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level, + int default_dims[]) { + int default_vector_length = PTX_VECTOR_LENGTH; + + /* For capability reasons, fallback to vl = 32 for runtime values. */ + if (dims[GOMP_DIM_VECTOR] == 0) + default_vector_length = PTX_WARP_SIZE; + else if (default_dims) + default_vector_length = default_dims[GOMP_DIM_VECTOR]; + /* Detect if a function is unsuitable for offloading. */ if (!flag_offload_force && decl) { @@ -5237,18 +5259,20 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level) bool changed = false; - /* The vector size must be 32, unless this is a SEQ routine. */ + /* The vector size must be a positive multiple of the warp size, + unless this is a SEQ routine. */ if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1 && dims[GOMP_DIM_VECTOR] >= 0 - && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH) + && (dims[GOMP_DIM_VECTOR] % 32 != 0 + || dims[GOMP_DIM_VECTOR] == 0)) { if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0) warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0, dims[GOMP_DIM_VECTOR] ? G_("using vector_length (%d), ignoring %d") : G_("using vector_length (%d), ignoring runtime setting"), - PTX_VECTOR_LENGTH, dims[GOMP_DIM_VECTOR]); - dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH; + default_vector_length, dims[GOMP_DIM_VECTOR]); + dims[GOMP_DIM_VECTOR] = default_vector_length; changed = true; } @@ -5262,16 +5286,77 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level) changed = true; } + /* Ensure that num_worker * vector_length < cta size. */ + if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE) + { + warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0, + G_("using vector_length (%d), ignoring %d"), + default_vector_length, dims[GOMP_DIM_VECTOR]); + dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE; + changed = true; + } + + /* vector_length must not exceed PTX_CTA_SIZE. */ + if (dims[GOMP_DIM_VECTOR] >= PTX_CTA_SIZE) + { + int new_vector = PTX_CTA_SIZE; + if (default_dims) + new_vector = default_vector_length; + warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0, + G_("using vector_length (%d), ignoring %d"), + new_vector, dims[GOMP_DIM_VECTOR]); + dims[GOMP_DIM_VECTOR] = new_vector; + changed = true; + } + + /* Set vector_length to default_vector_length if there are a sufficient + number of free threads in the CTA. */ + if (dims[GOMP_DIM_WORKER] > 0 && dims[GOMP_DIM_VECTOR] <= 0) + { + if (dims[GOMP_DIM_WORKER] * default_vector_length <= PTX_CTA_SIZE) + dims[GOMP_DIM_VECTOR] = default_vector_length; + else if (dims[GOMP_DIM_WORKER] * PTX_WARP_SIZE <= PTX_CTA_SIZE) + dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE; + else + error_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, + "vector_length must be at least 32"); + changed = true; + } + + /* Specify a default vector_length. */ + if (dims[GOMP_DIM_VECTOR] < 0) + { + dims[GOMP_DIM_VECTOR] = default_vector_length; + changed = true; + } + + if (nvptx_goacc_needs_vl_warp () && dims[GOMP_DIM_VECTOR] != PTX_WARP_SIZE) + { + dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE; + changed = true; + } + if (!decl) { - dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH; + bool new_vector = false; + if (dims[GOMP_DIM_VECTOR] <= 1) + { + dims[GOMP_DIM_VECTOR] = default_vector_length; + new_vector = true; + } if (dims[GOMP_DIM_WORKER] < 0) dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM; if (dims[GOMP_DIM_GANG] < 0) dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM; + if (new_vector + && dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE) + dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE; changed = true; } + gcc_assert (dims[GOMP_DIM_VECTOR] != 0); + gcc_assert (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] <= PTX_CTA_SIZE); + return changed; } @@ -5291,6 +5376,45 @@ nvptx_dim_limit (int axis) return 0; } +/* Adjust the parallelism available to a loop given vector_length + associated with the offloaded function. */ + +static unsigned +nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask) +{ + if (nvptx_goacc_needs_vl_warp ()) + return inner_mask; + + bool wv = (inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)); + offload_attrs oa; + + populate_offload_attrs (&oa); + + if (oa.vector_length == PTX_WARP_SIZE) + return inner_mask; + + /* FIXME: This is overly conservative; worker and vector loop will + eventually be combined. */ + if (wv) + return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER); + + /* It's difficult to guarantee that warps in large vector_lengths + will remain convergent when a vector loop is nested inside a + worker loop. Therefore, fallback to setting vector_length to + PTX_WARP_SIZE. Hopefully this condition may be relaxed for + sm_70+ targets. */ + if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))) + { + tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE, + DECL_ATTRIBUTES (current_function_decl)); + DECL_ATTRIBUTES (current_function_decl) = attr; + } + + return inner_mask; +} + /* Determine whether fork & joins are needed. */ static bool @@ -6180,6 +6304,9 @@ nvptx_set_current_function (tree fndecl) #undef TARGET_GOACC_DIM_LIMIT #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit +#undef TARGET_GOACC_ADJUST_PARALLELISM +#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism + #undef TARGET_GOACC_FORK_JOIN #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 0fcb9c64bf4..3028e438ddd 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -5865,7 +5865,7 @@ to use it. Return number of threads in SIMT thread group on the target. @end deftypefn -@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level}) +@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level}, int *@var{default_dims}) This hook should check the launch dimensions provided for an OpenACC compute region, or routine. Defaulted values are represented as -1 and non-constant values as 0. The @var{fn_level} is negative for the @@ -5873,9 +5873,10 @@ function corresponding to the compute region. For a routine is is the outermost level at which partitioned execution may be spawned. The hook should verify non-default values. If DECL is NULL, global defaults are being validated and unspecified defaults should be filled in. -Diagnostics should be issued as appropriate. Return -true, if changes have been made. You must override this hook to -provide dimensions larger than 1. +Diagnostics should be issued as appropriate. The @var{default_dims} +contain the user-specified default dims. Return true, if changes have +been made. You must override this hook to provide dimensions larger +than 1. @end deftypefn @deftypefn {Target Hook} int TARGET_GOACC_DIM_LIMIT (int @var{axis}) @@ -5883,6 +5884,12 @@ This hook should return the maximum size of a particular dimension, or zero if unbounded. @end deftypefn +@deftypefn {Target Hook} unsigned TARGET_GOACC_ADJUST_PARALLELISM (unsigned @var{this_mask}, unsigned @var{outer_mask}) +This hook allows the accelerator compiler to remove any unused +parallelism exposed in the current loop @var{THIS_MASK}, and the +enclosing loop @var{OUTER_MASK}. It returns an adjusted mask. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork}) This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN function calls to target-specific gimple, or indicate whether they diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 4187da139a9..fc73ad13e0a 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4298,6 +4298,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_GOACC_DIM_LIMIT +@hook TARGET_GOACC_ADJUST_PARALLELISM + @hook TARGET_GOACC_FORK_JOIN @hook TARGET_GOACC_REDUCTION diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c index ba3f4317f4e..f15ce6b8f8d 100644 --- a/gcc/omp-offload.c +++ b/gcc/omp-offload.c @@ -615,8 +615,8 @@ oacc_parse_default_dims (const char *dims) } /* Allow the backend to validate the dimensions. */ - targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1); - targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2); + targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1, NULL); + targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2, NULL); } /* Validate and update the dimensions for offloaded FN. ATTRS is the @@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims) function. */ static void -oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used) +oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used, + int * ARG_UNUSED (default_dims)) { tree purpose[GOMP_DIM_MAX]; unsigned ix; @@ -675,7 +676,8 @@ oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used) axes[ix], axes[ix]); } - bool changed = targetm.goacc.validate_dims (fn, dims, level); + bool changed = targetm.goacc.validate_dims (fn, dims, level, + oacc_default_dims); /* Default anything left to 1 or a partitioned default. */ for (ix = 0; ix != GOMP_DIM_MAX; ix++) @@ -1258,6 +1260,13 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask) } } + /* FIXME: Ideally, we should be coalescing parallelism here if the + hardware supports it. E.g. Instead of partitioning a loop + across worker and vector axes, sometimes the hardware can + execute those loops together without resorting to placing + extra thread barriers. */ + this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask); + mask_all |= this_mask; if (loop->flags & OLF_TILE) @@ -1349,6 +1358,7 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask, this_mask ^= loop->e_mask; } + this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask); loop->mask |= this_mask; } @@ -1397,6 +1407,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask, } loop->mask |= this_mask; + loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask); + if (!loop->mask && noisy) warning_at (loop->loc, 0, tiling @@ -1604,7 +1616,8 @@ execute_oacc_device_lower () } int dims[GOMP_DIM_MAX]; - oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask); + oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask, + NULL); if (dump_file) { @@ -1746,7 +1759,8 @@ execute_oacc_device_lower () bool default_goacc_validate_dims (tree ARG_UNUSED (decl), int *dims, - int ARG_UNUSED (fn_level)) + int ARG_UNUSED (fn_level), + int * ARG_UNUSED (default_dims)) { bool changed = false; @@ -1774,6 +1788,15 @@ default_goacc_dim_limit (int ARG_UNUSED (axis)) #endif } +/* Default adjustment of loop parallelism is not required. */ + +unsigned +default_goacc_adjust_parallelism (unsigned this_mask, + unsigned ARG_UNUSED (outer_mask)) +{ + return this_mask; +} + namespace { const pass_data pass_data_oacc_device_lower = diff --git a/gcc/target.def b/gcc/target.def index b302d3639da..aa7da2c1b2c 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1683,10 +1683,11 @@ function corresponding to the compute region. For a routine is is the\n\ outermost level at which partitioned execution may be spawned. The hook\n\ should verify non-default values. If DECL is NULL, global defaults\n\ are being validated and unspecified defaults should be filled in.\n\ -Diagnostics should be issued as appropriate. Return\n\ -true, if changes have been made. You must override this hook to\n\ -provide dimensions larger than 1.", -bool, (tree decl, int *dims, int fn_level), +Diagnostics should be issued as appropriate. The @var{default_dims}\n\ +contain the user-specified default dims. Return true, if changes have\n\ +been made. You must override this hook to provide dimensions larger\n\ +than 1.", +bool, (tree decl, int *dims, int fn_level, int *default_dims), default_goacc_validate_dims) DEFHOOK @@ -1696,6 +1697,14 @@ or zero if unbounded.", int, (int axis), default_goacc_dim_limit) +DEFHOOK +(adjust_parallelism, +"This hook allows the accelerator compiler to remove any unused\n\ +parallelism exposed in the current loop @var{THIS_MASK}, and the\n\ +enclosing loop @var{OUTER_MASK}. It returns an adjusted mask.", +unsigned, (unsigned this_mask, unsigned outer_mask), +default_goacc_adjust_parallelism) + DEFHOOK (fork_join, "This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\ diff --git a/gcc/targhooks.h b/gcc/targhooks.h index 18070df7839..b60c72a38f1 100644 --- a/gcc/targhooks.h +++ b/gcc/targhooks.h @@ -111,10 +111,11 @@ extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *); extern void default_destroy_cost_data (void *); /* OpenACC hooks. */ -extern bool default_goacc_validate_dims (tree, int [], int); +extern bool default_goacc_validate_dims (tree, int [], int, int []); extern int default_goacc_dim_limit (int); extern bool default_goacc_fork_join (gcall *, const int [], bool); extern void default_goacc_reduction (gcall *); +extern unsigned default_goacc_adjust_parallelism (unsigned, unsigned); /* These are here, and not in hooks.[ch], because not all users of hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS. */ -- 2.14.3