RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Fri, 18 Oct 2024 06:48:37 -0700


> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: 17 October 2024 19:18
> To: Prathamesh Kulkarni <prathame...@nvidia.com>
> Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 17 Oct 2024, Prathamesh Kulkarni wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: 16 October 2024 13:05
> > > To: Prathamesh Kulkarni <prathame...@nvidia.com>
> > > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote:
> > >
> > > > Hi,
> > > > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results
> in
> > > > ~60
> > > UNRESOLVED errors with following error message:
> > > >
> > > > lto1: fatal error: degree of 'poly_int' exceeds
> 'NUM_POLY_INT_COEFFS'
> > > > compilation terminated.
> > > > nvptx mkoffload: fatal error:
> > > > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc
> > > returned 1 exit status compilation terminated.
> > > >
> > > > This behaviour can be reproduced with the following simple
> > > > test-case
> > > with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2:
> > > >
> > > > #define N 1000
> > > > int main ()
> > > > {
> > > >   int i;
> > > >   int A[N] = {0}, B[N] = {0};
> > > >
> > > >   #pragma omp target map(i), map(tofrom: A), map(from: B)
> > > >   #pragma omp simd
> > > >   for (i = 0; i < N; i++)
> > > >     A[i] = A[i] + B[i];
> > > >   return A[0];
> > > > }
> > > >
> > > > omplower pass lowers the above loop to the following:
> > > >
> > > >                 D.4576 = .GOMP_USE_SIMT ();
> > > >                 if (D.4576 != 0) goto <D.4577>; else goto
> <D.4578>;
> > > >                 <D.4577>:
> > > >                 {
> > > >                   unsigned int D.4586;
> > > >                   unsigned int D.4587;
> > > >                   int D.4588;
> > > >                   void * simduid.5;
> > > >                   void * .omp_simt.6;
> > > >                   int D.4596;
> > > >                   _Bool D.4597;
> > > >                   int D.4598;
> > > >                   unsigned int D.4599;
> > > >                   int D.4600;
> > > >                   int D.4601;
> > > >                   int * D.4602;
> > > >                   int i [value-expr: D.4588];
> > > >                   int i.0;
> > > >
> > > >                   simduid.5 = .GOMP_SIMT_ENTER (simduid.5,
> &D.4588);
> > > >                   .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC
> (simduid.5);
> > > >                   D.4587 = 0;
> > > >                   i.0 = 0;
> > > >                   #pragma omp simd safelen(32)
> > > > _simduid_(simduid.5)
> > > _simt_ linear(i.0:1) linear(i:1)
> > > >                   for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1)
> > > >                   ...
> > > >                 }
> > > >                 goto <D.4579>;
> > > >                 <D.4578>:
> > > >                 {
> > > >                   unsigned int D.4603;
> > > >                   unsigned int D.4604;
> > > >                   int D.4605[0:POLY_INT_CST [15, 16]];
> > > >                   void * simduid.7;
> > > >                   unsigned int D.4612;
> > > >                   int * D.4613;
> > > >                   int D.4614;
> > > >                   int i [value-expr: D.4605[D.4604]];
> > > >                   int i.0;
> > > >
> > > >                   D.4604 = 0;
> > > >                   i.0 = 0;
> > > >                   #pragma omp simd safelen(POLY_INT_CST [16,
> 16])
> > > _simduid_(simduid.7) linear(i.0:1) linear(i:1)
> > > >                   ...
> > > >                  }
> > > >                  <D.4579>:
> > > >                  ...
> > > >
> > > > For offloading to SIMT based device like nvptx, scan_omp_simd
> > > > duplicates lowering of simd pragma into if-else where the if-
> part
> > > > contains simt code-path, and else-part contains simd code-path.
> In
> > > lower_rec_simd_input_clauses, max_vf is set to 16+16x for the
> above
> > > case as determined by omp_max_vf, and that becomes length of the
> omp
> > > simd
> > > array:
> > > > int D.4605[0:POLY_INT_CST [15, 16]];
> > > >
> > > > The issue here is that, the function containing above if-else
> > > > condition gets streamed out to LTO bytecode including the simd
> > > > code-
> > > path and the omp simd array, whose domain is [0:POLY_INT_CST[15,
> > > 16]], and thus we get the above error while streaming-in
> > > POLY_INT_CST in lto_input_ts_poly_tree_pointers on device side.
> > > >
> > > > Note that, the simd code-path is essentially dead-code on nvptx,
> > > > since
> > > > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and
> > > > later optimization passes (ccp2) remove the dead-code path and
> > > > unused omp
> > > simd arrays while compiling to device. So in this case, we aren't
> > > really mapping POLY_INT_CST from host to device, but it gets
> > > streamed out to device as an artefact of omp simd lowering.
> > > >
> > > > I suppose a proper fix here would be to (somehow) defer lowering
> > > > of omp pragma simd after streaming out to device, so the device
> > > > only sees simt code-path, and the host only sees simd code path
> ?
> > > > Or perhaps
> > > clone each function in offload region, one for host and one for
> SIMT
> > > device, and only stream the device versions to avoid streaming out
> > > host- specific IR changes ?
> > >
> > > There is currently no way to have the host compiler query offload
> > > target capabilities so the only true fix is to delay OMP SIMD
> > > lowering to the target.
> > Um, I thought we could use omp_max_simt_vf from host to query if the
> offload target is SIMT ?
> > The function essentially iterates thru OFFLOAD_TARGET_NAMES and
> returns non-zero for nvptx.
> > >
> > > Are we dependent on the early optimization pipeline being run on
> the
> > > host to produce the offload IL?  There's some oddball OACC passes
> in
> > > pass_ipa_oacc.
> > >
> > > That said, I'd probably try to produce clones with unlowered IL
> and
> > > skip those clones from all processing from that point and resume
> in
> > > the offload compiler.
> > >
> > > > I thought of following approaches as workarounds:
> > >
> > > I don't think any workaround will fly in the end.  Can't you
> simply
> > > force SVE to be off for offload clones on the host side and force
> > > OMP lowering with ADVSIMD only?
> > Would it be correct to set:
> > sctx.max_vf = constant_lower_bound (omp_max_vf ())
> >
> > if function is offloaded and omp_max_vf returns non-constant
> poly_int,
> > to force max_vf to be VLS, which will avoid VLA vectorization as in
> the attached patch ?
> >
> > Or should we modify autovectorize_vector_modes hook to return VLS
> > modes for offloaded functions ?
> 
> Can't you simply put a target(march=armv8.3a) (or even more basic ISA)
> on the OMP target clones?  Or is the lowering happening before
> outlining?
AFAIU, scan_omp_target in omp-lower creates a "child function" but doesn't 
outline it.
The actual outlining seems to be happening during omp-expand pass.


> In that case maybe we want to switch the host target into "offload-
> lowering-mode" (effectively switching to a basic ISA)?
Sorry, I didn't understand -- Do you mean we should enforce basic ISA for all 
the functions including host-only ones ?

Since the issue stems from having host details leaked into IL during 
omp-lowering (and omp-expand), I was wondering if
we should use default versions of the hooks if offloading is enabled in 
omp-lower and omp-expand, which should avoid host details
creeping into offload IL, sth like in the attached patch ?

Thanks,
Prathamesh
> 
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in
> > > > lower_rec_simd_input_clauses, if the function is going to be
> > > > offloaded and omp_max_vf returns non-constant poly_int. For
> above
> > > > case, it sets
> > > max_vf to 16 instead of 16+16x which seems to resolve the issue,
> but
> > > it'd use suboptimal max VF for host ? This is done in patch p-283-
> 2.txt.
> > > >
> > > > However, with clean trunk it still seems to use max_vf = 16
> after
> > > disabling the above error.
> > > > vect dump shows:
> > > >
> > > > (compute_affine_dependence
> > > >   ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51];
> > > >   ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27;
> > > > ) -> dependence analysis failed
> > > > foo.c:10:13: note:   dependence distance  = 0.
> > > > foo.c:10:13: note:   dependence distance == 0 between
> (*_23)[i.0_51]
> > > and (*_23)[i.0_51]
> > > > foo.c:10:13: missed:  bad data dependence.
> > > > foo.c:10:13: note:  ***** Analysis failed with vector mode
> VNx4SI
> > > >
> > > > This seems to happen because, loop->safelen is set to 16 by
> taking
> > > > MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd:
> > > >
> > > >       if (!poly_int_tree_p (safelen, &val))
> > > >         safelen_int = 0;
> > > >       else
> > > >         safelen_int = MIN (constant_lower_bound (val), INT_MAX);
> > > >
> > > > and fails to vectorize with VLA vectors, because max_vf == 16
> and
> > > min_vf == 4+4x resulting in bad data dependence due to:
> > > >
> > > >   if (max_vf != MAX_VECTORIZATION_FACTOR
> > > >       && maybe_lt (max_vf, min_vf))
> > > >     return opt_result::failure_at (vect_location, "bad data
> > > > dependence.\n");
> > > >
> > > > If safelen was (somehow) set to 16+16x, I guess it could have
> used
> > > VF=4+4x and vectorized with VLA vectors.
> > > > but I suppose that's a separate issue ?
> > > >
> > > > [2] Since the issue seems to be only with streaming out length
> of
> > > > omp simd array when it's POLY_INT_CST, could we perhaps use a
> > > > place holder length during omp lowering and compute the correct
> > > > length after streaming out, so POLY_INT_CST doesn't get leaked
> > > > into bytecode ? The
> > > attached patch p-283-3.txt follows this approach by using bogus
> > > length INT_MAX in lower_rec_simd_input_clauses if offloading to
> SIMT
> > > device and max_vf is non-constant poly_int, and later computing
> the
> > > correct length in beginning of vect pass by setting it to
> omp_max_vf
> > > (), but I am not sure if this is entirely correct.
> > > > I am assuming that creating omp simd array of bogus length will
> > > > not be an issue for nvptx since it will never get referenced and
> > > > eventually be removed by remove_unused_locals ? If it'd not be a
> > > > good idea to
> > > rely on the pass pipeline to eliminate simd code-path and omp simd
> > > array while compiling to device, it could be possibly done during
> > > omp_lower_device pass itself ?
> > > >
> > > > [3] While streaming-in POLY_INT_CST, avoid emitting error
> > > > immediately if degree of POLY_INT_CST exceeds accel's
> > > > NUM_POLY_INT_COEFFS to ignore POLY_INT_CSTs that may potentially
> > > > occur on dead-code path, and instead mark it as error_mark_node.
> > > > For the above case, since
> > > POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST
> with
> > > higher degree than accel's NUM_POLY_INT_COEFFS would be
> "harmless".
> > > And detect invalid POLY_INT_CST's in expand pass (if it survives
> > > till this point), and emit above error, but not sure if that'd be
> > > the right place ?
> > > > This is done in p-283-4.txt.
> > > >
> > > > All the three patches fix UNRESOLVED tests due to POLY_INT_CST
> > > streaming error in libgomp testsuite with -mcpu=generic+sve2.
> > > > (Altho it introduces a strange FAIL for data-5.f90, which I am
> > > investigating).
> > > > I would be grateful for suggestions on how to proceed.
> > > >
> > > > Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > Nuernberg)
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)

Use default versions of target hooks during omplower and ompexpand when 
offloading.

gcc/
        * omp-expand.cc (omp_adjust_chunk_size ()): Call
        omp_maybe_offloaded and pass it's value to omp_max_vf.
        * omp-general.cc: Include targhooks.h.
        (omp_max_vf): New parameter offload.
        Call default_autovectorize_vector_modes and
        default_preferred_simd_mode if offload is true.
        (omp_maybe_offloaded): Export.
        * omp-general.h (omp_max_vf): New parameter offload
        with default value false.
        (omp_maybe_offloaded): Declare.
        * omp-low.cc: Include targhooks.h.
        (omp_clause_aligned_alignment): Call default_autovectorize_vector_modes
        and default_preferred_simd_mode if offload is true.
        (lower_rec_simd_input_clauses): Call omp_maybe_offloaded_ctx and pass
        it's value to omp_max_vf.
        (lower_rec_input_clauses): Call omp_maybe_offloaded_ctx and pass it's
        value to omp_clause_aligned_alignment.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index 9ff9553c3ea..37ba36fb9b1 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -211,7 +211,7 @@ omp_adjust_chunk_size (tree chunk_size, bool simd_schedule)
   if (!simd_schedule || integer_zerop (chunk_size))
     return chunk_size;
 
-  poly_uint64 vf = omp_max_vf ();
+  poly_uint64 vf = omp_max_vf (omp_maybe_offloaded ());
   if (known_eq (vf, 1U))
     return chunk_size;
 
diff --git a/gcc/omp-general.cc b/gcc/omp-general.cc
index f4c5f577047..7a03fdcaf49 100644
--- a/gcc/omp-general.cc
+++ b/gcc/omp-general.cc
@@ -44,6 +44,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "streamer-hooks.h"
 #include "opts.h"
 #include "tree-pretty-print.h"
+#include "targhooks.h"
 
 enum omp_requires omp_requires_mask;
 
@@ -989,7 +990,7 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void 
*data)
 /* Return maximum possible vectorization factor for the target.  */
 
 poly_uint64
-omp_max_vf (void)
+omp_max_vf (bool offload)
 {
   if (!optimize
       || optimize_debug
@@ -999,7 +1000,10 @@ omp_max_vf (void)
     return 1;
 
   auto_vector_modes modes;
-  targetm.vectorize.autovectorize_vector_modes (&modes, true);
+  if (offload)
+    default_autovectorize_vector_modes (&modes, true);
+  else
+    targetm.vectorize.autovectorize_vector_modes (&modes, true);
   if (!modes.is_empty ())
     {
       poly_uint64 vf = 0;
@@ -1011,7 +1015,9 @@ omp_max_vf (void)
       return vf;
     }
 
-  machine_mode vqimode = targetm.vectorize.preferred_simd_mode (QImode);
+  machine_mode vqimode
+    = (offload) ? default_preferred_simd_mode (QImode)
+               : targetm.vectorize.preferred_simd_mode (QImode);
   if (GET_MODE_CLASS (vqimode) == MODE_VECTOR_INT)
     return GET_MODE_NUNITS (vqimode);
 
@@ -1107,7 +1113,7 @@ omp_offload_device_kind_arch_isa (const char *props, 
const char *prop)
    Return true in declare target functions, or when nested in a target
    region or when unsure, return false otherwise.  */
 
-static bool
+bool
 omp_maybe_offloaded (void)
 {
   if (!ENABLE_OFFLOADING)
diff --git a/gcc/omp-general.h b/gcc/omp-general.h
index 891f467556e..6cf0829d13b 100644
--- a/gcc/omp-general.h
+++ b/gcc/omp-general.h
@@ -162,7 +162,7 @@ extern void omp_extract_for_data (gomp_for *for_stmt, 
struct omp_for_data *fd,
                                  struct omp_for_data_loop *loops);
 extern gimple *omp_build_barrier (tree lhs);
 extern tree find_combined_omp_for (tree *, int *, void *);
-extern poly_uint64 omp_max_vf (void);
+extern poly_uint64 omp_max_vf (bool offload=false);
 extern int omp_max_simt_vf (void);
 extern const char *omp_context_name_list_prop (tree);
 extern void omp_construct_traits_to_codes (tree, int, enum tree_code *);
@@ -186,6 +186,7 @@ extern tree oacc_get_fn_attrib (tree fn);
 extern bool offloading_function_p (tree fn);
 extern int oacc_get_fn_dim_size (tree fn, int axis);
 extern int oacc_get_ifn_dim_arg (const gimple *stmt);
+extern bool omp_maybe_offloaded (void);
 
 enum omp_requires {
   OMP_REQUIRES_ATOMIC_DEFAULT_MEM_ORDER = 0xf,
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index da2051b0279..2c09bcebee1 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -60,6 +60,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "attribs.h"
 #include "omp-offload.h"
+#include "targhooks.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -4523,7 +4524,7 @@ omp_reduction_init (tree clause, tree type)
    OMP_CLAUSE_ALIGNED.  */
 
 static tree
-omp_clause_aligned_alignment (tree clause)
+omp_clause_aligned_alignment (tree clause, bool offload)
 {
   if (OMP_CLAUSE_ALIGNED_ALIGNMENT (clause))
     return OMP_CLAUSE_ALIGNED_ALIGNMENT (clause);
@@ -4532,7 +4533,11 @@ omp_clause_aligned_alignment (tree clause)
   unsigned int al = 1;
   opt_scalar_mode mode_iter;
   auto_vector_modes modes;
-  targetm.vectorize.autovectorize_vector_modes (&modes, true);
+  if (offload)
+    default_autovectorize_vector_modes (&modes, true);
+  else
+    targetm.vectorize.autovectorize_vector_modes (&modes, true);
+
   static enum mode_class classes[]
     = { MODE_INT, MODE_VECTOR_INT, MODE_FLOAT, MODE_VECTOR_FLOAT };
   for (int i = 0; i < 4; i += 2)
@@ -4540,7 +4545,9 @@ omp_clause_aligned_alignment (tree clause)
     FOR_EACH_MODE_IN_CLASS (mode_iter, classes[i])
       {
        scalar_mode mode = mode_iter.require ();
-       machine_mode vmode = targetm.vectorize.preferred_simd_mode (mode);
+       machine_mode vmode
+         = (offload) ? default_preferred_simd_mode (mode)
+                     : targetm.vectorize.preferred_simd_mode (mode);
        if (GET_MODE_CLASS (vmode) != classes[i + 1])
          continue;
        machine_mode alt_vmode;
@@ -4588,7 +4595,8 @@ lower_rec_simd_input_clauses (tree new_var, omp_context 
*ctx,
 {
   if (known_eq (sctx->max_vf, 0U))
     {
-      sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf ();
+      bool offload = omp_maybe_offloaded_ctx (ctx);
+      sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf (offload);
       if (maybe_gt (sctx->max_vf, 1U))
        {
          tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),
@@ -5106,7 +5114,8 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, 
gimple_seq *dlist,
                  if (new_var == NULL_TREE)
                    new_var = maybe_lookup_decl_in_outer_ctx (var, ctx);
                  x = builtin_decl_explicit (BUILT_IN_ASSUME_ALIGNED);
-                 tree alarg = omp_clause_aligned_alignment (c);
+                 bool offload = omp_maybe_offloaded_ctx (ctx);
+                 tree alarg = omp_clause_aligned_alignment (c, offload);
                  alarg = fold_convert_loc (clause_loc, size_type_node, alarg);
                  x = build_call_expr_loc (clause_loc, x, 2, new_var, alarg);
                  x = fold_convert_loc (clause_loc, TREE_TYPE (new_var), x);
@@ -5121,7 +5130,8 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, 
gimple_seq *dlist,
                  t = maybe_lookup_decl_in_outer_ctx (var, ctx);
                  t = build_fold_addr_expr_loc (clause_loc, t);
                  t2 = builtin_decl_explicit (BUILT_IN_ASSUME_ALIGNED);
-                 tree alarg = omp_clause_aligned_alignment (c);
+                 bool offload = omp_maybe_offloaded_ctx (ctx);
+                 tree alarg = omp_clause_aligned_alignment (c, offload);
                  alarg = fold_convert_loc (clause_loc, size_type_node, alarg);
                  t = build_call_expr_loc (clause_loc, t2, 2, t, alarg);
                  t = fold_convert_loc (clause_loc, ptype, t);

RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to