[og7] vector_length extension part 4: target hooks and automatic parallelism

Cesar Philippidis Fri, 02 Mar 2018 11:19:01 -0800

The attached patch adjusts the existing goacc validate_dims target hook
and introduces a new goacc adjust_parallelism target hook. Now that
vector length is no longer hard-coded to 32, there are four different
ways to set it:


  1) compiler default
  2) explicitly via the vector_length clause
  3) compile time using -fopenacc-dim or the GOMP_OPENACC_DIM
     environment variable
  4) fallback to vector_length = 32 due to insufficient parallelism

The compiler default is activated in the absence of 2) and 3). It is
controlled by the macro PTX_VECTOR_LENGTH in nvptx.c. While working on
this patch set, I had it set to 128 to get more test coverage. But in
order to maintain backwards compatibility with acc routines (which is
still a work in progress), I've kept the default vector length to 32.
Besides, large vector reductions are expected to run slower until the
parallel reduction finalizer is ready.

The new default_dims arguments to validate_dims represents is necessary
to accommodate option 3) from above. validate_dims is called after
oaccdevlow has assigned parallelism to each acc loop.

Prior to this patch, oaccdevlow automatically assigned parallelism to
acc loops using oacc_loop_fixed_partitions and
oacc_loop_auto_partitions. Both of those functions were
processor-agnostic. In the case of nvptx, due to the current limitations
in this patch set, the nvptx BE needs to fallback to using a
vector_length of 32 whenever a vector loop is nested inside a worker
loop. By supplying the parallelism mask for both the current loop and
the outer loops, the goacc adjust_parallelism hook allows the back ends
to fine tune any parallelism as necessary.

Inside the nvptx BE, nvptx_goacc_adjust_parallelism uses a new "nvptx vl
warp" function attribute to denote that the offloaded function must
fallback to using a vector length of 32. Later,
nvptx_goacc_validate_dims uses the attribute to adjust vector_length
accordingly.

Going forward, in addition to adding a new parallel reduction finalizer,
the nvptx BE would benefit from merging synchronization and reduction
code for combined worker-reduction loops, e.g.

  #pragma acc loop worker vector

At present, GCC partitions acc loops with internal function markers for
each level of parallelism associated with the loop. If a loop has both
worker and vector level parallelism, it will have a dummy outer worker
loop, and dummy inner vector loop. On CUDA hardware, there's no strong
difference between workers and vectors as CUDA blocks are a loose
collection of warps. Therefore, it would make more sense to merge the
two loops together into a special WV loop. That would at least require
some changes in the BE in addition to oacc_loop_{auto,fixed}_partitions.
There were some problems in the past where CUDA hardware would lock up
because the synchronization requirements for those two levels of
parallelism. Merging them ought to simplify the synchronization code and
enable the PTX JIT to generate better code.

Overall, the changes in this patch are mild. I'll apply it to
openacc-gcc-7-branch after Tom approves the reduction patch.

Cesar

2018-03-02  Cesar Philippidis  <ce...@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (NVPTX_GOACC_VL_WARP): Define.
	(nvptx_goacc_needs_vl_warp): New function.
	(nvptx_goacc_validate_dims): Add new default_dims argument and take
	larger vector lengths into account.
	(nvptx_adjust_parallelism): New function.
	(TARGET_GOACC_ADJUST_PARALLELISM): Define.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Add placeholder for TARGET_GOACC_ADJUST_PARALLELISM.
	* omp-offload.c (oacc_parse_default_dims): Update usage of the
	targetm.goacc_valdate_dims hook.
	(oacc_validate_dims): Add default_dims argument.
	(oacc_loop_fixed_partitions): Use the adjust_parallelism hook to
	modify this_mask.
	(oacc_loop_auto_partitions): Use the adjust_parallelism hook to
	modify this_mask and loop->mask.
	(execute_oacc_device_lower): Update call to oacc_validate_dims.
	(default_goacc_adjust_parallelism): New function.
	* target.def (validate_dims): Add new default_dims argument.
	(adjust_parallelism): New hook.
	* targhooks.h (default_goacc_validate_dims): Add new argument.
	(default_goacc_adjust_parallelism): Declare.

>From 1ee16b267dfbb0a148e8ec3b83ca463c21cbac1d Mon Sep 17 00:00:00 2001
From: Cesar Philippidis <ce...@codesourcery.com>
Date: Fri, 2 Mar 2018 10:08:23 -0800
Subject: [PATCH] New target hooks

---
 gcc/config/nvptx/nvptx.c | 139 +++++++++++++++++++++++++++++++++++++++++++++--
 gcc/doc/tm.texi          |  15 +++--
 gcc/doc/tm.texi.in       |   2 +
 gcc/omp-offload.c        |  35 ++++++++++--
 gcc/target.def           |  17 ++++--
 gcc/targhooks.h          |   3 +-
 6 files changed, 190 insertions(+), 21 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5642941c6a3..507c8671704 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5205,14 +5205,36 @@ nvptx_simt_vf ()
   return PTX_WARP_SIZE;
 }
 
+#define NVPTX_GOACC_VL_WARP "nvptx vl warp"
+
+/* Return true of the offloaded function needs a vector_length of
+   PTX_WARP_SIZE.  */
+
+static bool
+nvptx_goacc_needs_vl_warp ()
+{
+  tree attr = lookup_attribute (NVPTX_GOACC_VL_WARP,
+				DECL_ATTRIBUTES (current_function_decl));
+  return attr == NULL_TREE;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
    DECL is null, we are validating the default dimensions.  */
 
 static bool
-nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
+nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level,
+			   int default_dims[])
 {
+  int default_vector_length = PTX_VECTOR_LENGTH;
+
+  /* For capability reasons, fallback to vl = 32 for runtime values.  */
+  if (dims[GOMP_DIM_VECTOR] == 0)
+    default_vector_length = PTX_WARP_SIZE;
+  else if (default_dims)
+      default_vector_length = default_dims[GOMP_DIM_VECTOR];
+
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
     {
@@ -5237,18 +5259,20 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 
   bool changed = false;
 
-  /* The vector size must be 32, unless this is a SEQ routine.  */
+  /* The vector size must be a positive multiple of the warp size,
+     unless this is a SEQ routine.  */
   if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
       && dims[GOMP_DIM_VECTOR] >= 0
-      && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH)
+      && (dims[GOMP_DIM_VECTOR] % 32 != 0
+	  || dims[GOMP_DIM_VECTOR] == 0))
     {
       if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		    dims[GOMP_DIM_VECTOR]
 		    ? G_("using vector_length (%d), ignoring %d")
 		    : G_("using vector_length (%d), ignoring runtime setting"),
-		    PTX_VECTOR_LENGTH, dims[GOMP_DIM_VECTOR]);
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+		    default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
       changed = true;
     }
 
@@ -5262,16 +5286,77 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       changed = true;
     }
 
+  /* Ensure that num_worker * vector_length < cta size.  */
+  if (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+    {
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  default_vector_length, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
+  /* vector_length must not exceed PTX_CTA_SIZE.  */
+  if (dims[GOMP_DIM_VECTOR] >= PTX_CTA_SIZE)
+    {
+      int new_vector = PTX_CTA_SIZE;
+      if (default_dims)
+	new_vector = default_vector_length;
+      warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
+		  G_("using vector_length (%d), ignoring %d"),
+		  new_vector, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = new_vector;
+      changed = true;
+    }
+
+  /* Set vector_length to default_vector_length if there are a sufficient
+     number of free threads in the CTA.  */
+  if (dims[GOMP_DIM_WORKER] > 0 && dims[GOMP_DIM_VECTOR] <= 0)
+    {
+      if (dims[GOMP_DIM_WORKER] * default_vector_length <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = default_vector_length;
+      else if (dims[GOMP_DIM_WORKER] * PTX_WARP_SIZE <= PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      else
+	error_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION,
+		  "vector_length must be at least 32");
+      changed = true;
+    }
+
+  /* Specify a default vector_length.  */
+  if (dims[GOMP_DIM_VECTOR] < 0)
+    {
+      dims[GOMP_DIM_VECTOR] = default_vector_length;
+      changed = true;
+    }
+
+  if (nvptx_goacc_needs_vl_warp () && dims[GOMP_DIM_VECTOR] != PTX_WARP_SIZE)
+    {
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
+
   if (!decl)
     {
-      dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+      bool new_vector = false;
+      if (dims[GOMP_DIM_VECTOR] <= 1)
+	{
+	  dims[GOMP_DIM_VECTOR] = default_vector_length;
+	  new_vector = true;
+	}
       if (dims[GOMP_DIM_WORKER] < 0)
 	dims[GOMP_DIM_WORKER] = PTX_DEFAULT_RUNTIME_DIM;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = PTX_DEFAULT_RUNTIME_DIM;
+      if (new_vector
+	  && dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] > PTX_CTA_SIZE)
+	dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
       changed = true;
     }
 
+  gcc_assert (dims[GOMP_DIM_VECTOR] != 0);
+  gcc_assert (dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR] <= PTX_CTA_SIZE);
+
   return changed;
 }
 
@@ -5291,6 +5376,45 @@ nvptx_dim_limit (int axis)
   return 0;
 }
 
+/* Adjust the parallelism available to a loop given vector_length
+   associated with the offloaded function.  */
+
+static unsigned
+nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
+{
+  if (nvptx_goacc_needs_vl_warp ())
+    return inner_mask;
+
+  bool wv = (inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
+
+  if (oa.vector_length == PTX_WARP_SIZE)
+    return inner_mask;
+
+  /* FIXME: This is overly conservative; worker and vector loop will
+     eventually be combined.  */
+  if (wv)
+    return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
+
+  /* It's difficult to guarantee that warps in large vector_lengths
+     will remain convergent when a vector loop is nested inside a
+     worker loop.  Therefore, fallback to setting vector_length to
+     PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
+     sm_70+ targets.  */
+  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+      && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
+			      DECL_ATTRIBUTES (current_function_decl));
+      DECL_ATTRIBUTES (current_function_decl) = attr;
+    }
+
+  return inner_mask;
+}
+
 /* Determine whether fork & joins are needed.  */
 
 static bool
@@ -6180,6 +6304,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_GOACC_DIM_LIMIT
 #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit
 
+#undef TARGET_GOACC_ADJUST_PARALLELISM
+#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism
+
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 0fcb9c64bf4..3028e438ddd 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5865,7 +5865,7 @@ to use it.
 Return number of threads in SIMT thread group on the target.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
+@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level}, int *@var{default_dims})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
 and non-constant values as 0.  The @var{fn_level} is negative for the
@@ -5873,9 +5873,10 @@ function corresponding to the compute region.  For a routine is is the
 outermost level at which partitioned execution may be spawned.  The hook
 should verify non-default values.  If DECL is NULL, global defaults
 are being validated and unspecified defaults should be filled in.
-Diagnostics should be issued as appropriate.  Return
-true, if changes have been made.  You must override this hook to
-provide dimensions larger than 1.
+Diagnostics should be issued as appropriate.  The @var{default_dims}
+contain the user-specified default dims.  Return true, if changes have
+been made.  You must override this hook to provide dimensions larger
+than 1.
 @end deftypefn
 
 @deftypefn {Target Hook} int TARGET_GOACC_DIM_LIMIT (int @var{axis})
@@ -5883,6 +5884,12 @@ This hook should return the maximum size of a particular dimension,
 or zero if unbounded.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_GOACC_ADJUST_PARALLELISM (unsigned @var{this_mask}, unsigned @var{outer_mask})
+This hook allows the accelerator compiler to remove any unused
+parallelism exposed in the current loop @var{THIS_MASK}, and the
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
 This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN
 function calls to target-specific gimple, or indicate whether they
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4187da139a9..fc73ad13e0a 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4298,6 +4298,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_DIM_LIMIT
 
+@hook TARGET_GOACC_ADJUST_PARALLELISM
+
 @hook TARGET_GOACC_FORK_JOIN
 
 @hook TARGET_GOACC_REDUCTION
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index ba3f4317f4e..f15ce6b8f8d 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -615,8 +615,8 @@ oacc_parse_default_dims (const char *dims)
     }
 
   /* Allow the backend to validate the dimensions.  */
-  targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1);
-  targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2);
+  targetm.goacc.validate_dims (NULL_TREE, oacc_default_dims, -1, NULL);
+  targetm.goacc.validate_dims (NULL_TREE, oacc_min_dims, -2, NULL);
 }
 
 /* Validate and update the dimensions for offloaded FN.  ATTRS is the
@@ -626,7 +626,8 @@ oacc_parse_default_dims (const char *dims)
    function.  */
 
 static void
-oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
+oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used,
+		    int * ARG_UNUSED (default_dims))
 {
   tree purpose[GOMP_DIM_MAX];
   unsigned ix;
@@ -675,7 +676,8 @@ oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
 		      axes[ix], axes[ix]);
     }
 
-  bool changed = targetm.goacc.validate_dims (fn, dims, level);
+  bool changed = targetm.goacc.validate_dims (fn, dims, level,
+					      oacc_default_dims);
 
   /* Default anything left to 1 or a partitioned default.  */
   for (ix = 0; ix != GOMP_DIM_MAX; ix++)
@@ -1258,6 +1260,13 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
 	}
     }
 
+  /* FIXME: Ideally, we should be coalescing parallelism here if the
+     hardware supports it.  E.g. Instead of partitioning a loop
+     across worker and vector axes, sometimes the hardware can
+     execute those loops together without resorting to placing
+     extra thread barriers.  */
+  this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
+
   mask_all |= this_mask;
 
   if (loop->flags & OLF_TILE)
@@ -1349,6 +1358,7 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	  this_mask ^= loop->e_mask;
 	}
 
+      this_mask = targetm.goacc.adjust_parallelism (this_mask, outer_mask);
       loop->mask |= this_mask;
     }
 
@@ -1397,6 +1407,8 @@ oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask,
 	}
 
       loop->mask |= this_mask;
+      loop->mask = targetm.goacc.adjust_parallelism (loop->mask, outer_mask);
+
       if (!loop->mask && noisy)
 	warning_at (loop->loc, 0,
 		    tiling
@@ -1604,7 +1616,8 @@ execute_oacc_device_lower ()
     }
 
   int dims[GOMP_DIM_MAX];
-  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask);
+  oacc_validate_dims (current_function_decl, attrs, dims, fn_level, used_mask,
+		      NULL);
 
   if (dump_file)
     {
@@ -1746,7 +1759,8 @@ execute_oacc_device_lower ()
 
 bool
 default_goacc_validate_dims (tree ARG_UNUSED (decl), int *dims,
-			     int ARG_UNUSED (fn_level))
+			     int ARG_UNUSED (fn_level),
+			     int * ARG_UNUSED (default_dims))
 {
   bool changed = false;
 
@@ -1774,6 +1788,15 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 #endif
 }
 
+/* Default adjustment of loop parallelism is not required.  */
+
+unsigned
+default_goacc_adjust_parallelism (unsigned this_mask,
+				  unsigned ARG_UNUSED (outer_mask))
+{
+  return this_mask;
+}
+
 namespace {
 
 const pass_data pass_data_oacc_device_lower =
diff --git a/gcc/target.def b/gcc/target.def
index b302d3639da..aa7da2c1b2c 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1683,10 +1683,11 @@ function corresponding to the compute region.  For a routine is is the\n\
 outermost level at which partitioned execution may be spawned.  The hook\n\
 should verify non-default values.  If DECL is NULL, global defaults\n\
 are being validated and unspecified defaults should be filled in.\n\
-Diagnostics should be issued as appropriate.  Return\n\
-true, if changes have been made.  You must override this hook to\n\
-provide dimensions larger than 1.",
-bool, (tree decl, int *dims, int fn_level),
+Diagnostics should be issued as appropriate.  The @var{default_dims}\n\
+contain the user-specified default dims.  Return true, if changes have\n\
+been made.  You must override this hook to provide dimensions larger\n\
+than 1.",
+bool, (tree decl, int *dims, int fn_level, int *default_dims),
 default_goacc_validate_dims)
 
 DEFHOOK
@@ -1696,6 +1697,14 @@ or zero if unbounded.",
 int, (int axis),
 default_goacc_dim_limit)
 
+DEFHOOK
+(adjust_parallelism,
+"This hook allows the accelerator compiler to remove any unused\n\
+parallelism exposed in the current loop @var{THIS_MASK}, and the\n\
+enclosing loop @var{OUTER_MASK}.  It returns an adjusted mask.",
+unsigned, (unsigned this_mask, unsigned outer_mask),
+default_goacc_adjust_parallelism)
+
 DEFHOOK
 (fork_join,
 "This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 18070df7839..b60c72a38f1 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -111,10 +111,11 @@ extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
-extern bool default_goacc_validate_dims (tree, int [], int);
+extern bool default_goacc_validate_dims (tree, int [], int, int []);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern unsigned default_goacc_adjust_parallelism (unsigned, unsigned);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
-- 
2.14.3

[og7] vector_length extension part 4: target hooks and automatic parallelism

Reply via email to