RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Fri, 27 Dec 2024 04:30:10 -0800


> -----Original Message-----
> From: Jakub Jelinek <ja...@redhat.com>
> Sent: 17 December 2024 19:09
> To: Prathamesh Kulkarni <prathame...@nvidia.com>
> Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> <richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
> gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, Dec 02, 2024 at 11:17:08AM +0000, Prathamesh Kulkarni wrote:
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -233,6 +233,12 @@ public:
> >       flag_finite_loops or similar pragmas state.  */
> >    unsigned finite_p : 1;
> >
> > +  /* True if SIMD loop needs delayed lowering of artefacts like
> > +     safelen and length of omp simd arrays that depend on target's
> > +     max_vf.  This is true for offloading, when max_vf is computed
> after
> > +     streaming out to device.  */
> > +  unsigned needs_max_vf_lowering: 1;
> 
> Consistency, finite_p above uses space before :, the above line
> doesn't.
> 
> > --- a/gcc/omp-expand.cc
> > +++ b/gcc/omp-expand.cc
> > @@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region,
> struct omp_for_data *fd)
> >        loop->latch = cont_bb;
> >        add_loop (loop, l1_bb->loop_father);
> >        loop->safelen = safelen_int;
> > +      loop->needs_max_vf_lowering = is_in_offload_region (region);
> > +      if (loop->needs_max_vf_lowering)
> > +     cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> 
> Do you really need this for non-SVE arches?
> I mean, could you not set loop->needs_max_vf_lowering if maximum
> number of poly_int coeffs is 1?  Or if omp_max_vf returns constant or
> something similar?
Well, I guess the issue is not really about VLA vectors but when host and 
device have
different max_vf, and selecting optimal max_vf is not really possible during 
omp-low/omp-expand,
since we don't have device's target info available at this point. Andrew's 
recent patch works around this
limitation by searching for "amdgcn" in OFFLOAD_TARGET_NAMES in omp_max_vf, but 
I guess a more general solution
would be to delay lowering max_vf after streaming-out to device irrespective of 
VLA/VLS vectors ?
For AArch64/nvptx offloading with SVE, where host is VLA and device is VLS, the 
issue is more pronounced (failing to compile),
compared to offloading from VLS host to VLS device (selecting sub-optimal 
max_vf).
> 
> > --- a/gcc/omp-offload.cc
> > +++ b/gcc/omp-offload.cc
> > @@ -2617,6 +2617,77 @@ find_simtpriv_var_op (tree *tp, int
> *walk_subtrees, void *)
> >    return NULL_TREE;
> >  }
> >
> > +/* Compute max_vf for target, and accordingly set loop->safelen and
> length
> > +   of omp simd arrays.  */
> > +
> > +static void
> > +adjust_max_vf (function *fun)
> > +{
> > +  if (!fun->has_simduid_loops)
> > +    return;
> > +
> > +  poly_uint64 max_vf = omp_max_vf (false);
> > +
> > +  /* Since loop->safelen has to be an integer, it's not always
> possible
> > +     to compare against poly_int.  For eg 32 and 16+16x are not
> comparable at
> > +     compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32
> for x >= 2.
> > +     Even if we could get runtime VL based on -mcpu/-march, that
> would not be
> > +     portable across other SVE archs.
> > +
> > +     For now, use constant_lower_bound (max_vf), as a "safer
> approximation" to
> > +     max_vf that avoids these issues, with the downside that it
> will be
> > +     suboptimal max_vf for SVE archs implementing SIMD width > 128
> > + bits.  */
> > +
> > +  uint64_t max_vf_int;
> > +  if (!max_vf.is_constant (&max_vf_int))
> > +    max_vf_int = constant_lower_bound (max_vf);
> > +
> > +  calculate_dominance_info (CDI_DOMINATORS);  for (auto loop:
> > + loops_list (fun, 0))
> > +    {
> > +      if (!loop->needs_max_vf_lowering)
> > +     continue;
> > +
> > +      if (loop->safelen > max_vf_int)
> > +     loop->safelen = max_vf_int;
> > +
> > +      basic_block *bbs = get_loop_body (loop);
> 
> I still think using the tree-vectorizer.cc infrastructure is much
> better here.
> There is no guarantee all accesses to the simd arrays will be within
> the loop body, in fact, none of them could be there.  Consider e.g.
> parts of loop body (in the C meaning) followed by noreturn calls,
> those aren't considered loop body in the cfg.
> So, I think it is much better to walk the whole function once, not for
> each loop walk its loop body (that could be even more expensive if
> there are nested needs_max_vf_lowering loops).
> And note_simd_array_uses has it all implemented.
> Building a mapping between
> So, if you drop above
>   calculate_dominance_info (CDI_DOMINATORS); and when seeing first
> loop->needs_max_vf_lowering loop perform note_simd_array_uses (just
> once per function), for the simduid -> loop mapping you actually only
> care which simduid corresponds to
> loop->needs_max_vf_lowering loop and which doesn't, so say hash set
> loop->indexed
> by simduid would be good enough.
Thanks for the suggestions. The attached patch uses note_simd_array_uses to 
build
omp simd array -> simduid mapping, and creates a new mapping from simduid -> 
safelen,
and uses it to adjust length of omp simd array.
Does it look OK ?


While working on the patch, I stumbled on an issue with note_simd_array_uses, 
which I have filed
as PR118200, and posted patch for it:
https://gcc.gnu.org/pipermail/gcc-patches/2024-December/672269.html

Patch (combined with PR118200 patch) passes libgomp testing with AArch64/nvptx 
offloading (with/without SVE)
and bootstrap+test on aarch64-linux-gnu.

Thanks,
Prathamesh
> 
> 
> > +      for (unsigned i = 0; i < loop->num_nodes; i++)
> > +     for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi);
> gsi_next (&gsi))
> > +       {
> > +         gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi));
> > +         tree lhs = NULL_TREE;
> > +
> > +         if (stmt
> > +             && gimple_call_internal_p (stmt)
> > +             && (gimple_call_internal_fn (stmt) ==
> IFN_GOMP_SIMD_LANE)
> > +             && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE))
> > +           {
> > +             imm_use_iterator use_iter;
> > +             gimple *use_stmt;
> > +
> > +             FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs)
> > +               {
> > +                 gassign *assign_stmt = dyn_cast<gassign *>
> (use_stmt);
> > +                 if (!assign_stmt)
> > +                   continue;
> > +
> > +                 tree assign_stmt_lhs = gimple_assign_lhs
> (assign_stmt);
> > +                 if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF)
> > +                   {
> > +                     tree array = TREE_OPERAND (assign_stmt_lhs,
> 0);
> > +                     tree elt_type = TREE_TYPE (TREE_TYPE (array));
> > +                     tree atype = build_array_type_nelts (elt_type,
> > +                                                          loop-
> >safelen);
> > +                     TREE_TYPE (array) = atype;
> > +                     relayout_decl (array);
> 
> Appart from the possibility that .GOMP_SIMD_LANE call will end up not
> in the loop body, this has also the disadvantage that if there are say
> 1000 accesses to some simd array, you change the VAR_DECL (new array
> type,
> relayout_decl) it 1000 times.
> By just traversing the hash table each one that needs to be modified
> will be touched just once.
> 
>         Jakub

Delay lowering safelen if offloading is enabled.

gcc/ChangeLog:

        * cfgloop.h (loop): New bitfield needs_max_vf_lowering.
        * omp-expand.cc (expand_omp_simd): Set loop->needs_max_vf_lowering to 
result
        of is_in_offload_region.
        * omp-low.cc (lower_rec_simd_input_clauses): Set a placeholder value
        for length of omp simd array if offloading is enabled and max_vf is
        non-constant poly_int.
        * tree-vectorizer.cc (simduid_to_safelen): New class.
        (adjust_simd_arrays_length): New function.
        * tree-vectorizer.h (adjust_simd_arrays_length): Declare.
        * omp-offload.cc: Include tree-vectorizer.h.
        (adjust_max_vf): New function.
        (execute_omp_device_lower): Call adjust_max_vf, and use
        constant_lower_bound on result of omp_max_vf while replacing call
        to .GOMP_MAX_VF.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> 

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 30b5e40d0d9..f68e6317eed 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -233,6 +233,12 @@ public:
      flag_finite_loops or similar pragmas state.  */
   unsigned finite_p : 1;
 
+  /* True if SIMD loop needs delayed lowering of artefacts like
+     safelen and length of omp simd arrays that depend on target's
+     max_vf.  This is true for offloading, when max_vf is computed after
+     streaming out to device.  */
+  unsigned needs_max_vf_lowering : 1;
+
   /* The number of times to unroll the loop.  0 means no information given,
      just do what we always do.  A value of 1 means do not unroll the loop.
      A value of USHRT_MAX means unroll with no specific unrolling factor.
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index 8eb6f6f718c..33661c0f0f3 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
       loop->latch = cont_bb;
       add_loop (loop, l1_bb->loop_father);
       loop->safelen = safelen_int;
+      loop->needs_max_vf_lowering = is_in_offload_region (region);
+      if (loop->needs_max_vf_lowering)
+       cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+
       if (simduid)
        {
          loop->simduid = OMP_CLAUSE__SIMDUID__DECL (simduid);
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index 33d81604cbe..503f32c7f57 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4664,7 +4664,17 @@ lower_rec_simd_input_clauses (tree new_var, omp_context 
*ctx,
     }
   else
     {
-      tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf);
+      poly_uint64 nelts;
+      /* If offloading is enabled, and max_vf is poly_int, avoid
+        using it as length of omp simd array, and use a placeholder value
+        instead to avoid streaming out poly_int to device.  The choice of
+        placeholder value (64) doesn't really matter since the arrays
+        will be set to correct length later in omp_device_lower pass.  */
+      if (omp_maybe_offloaded_ctx (ctx) && !sctx->max_vf.is_constant ())
+       nelts = 64;
+      else
+       nelts = sctx->max_vf;
+      tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts);
       tree avar = create_tmp_var_raw (atype);
       if (TREE_ADDRESSABLE (new_var))
        TREE_ADDRESSABLE (avar) = 1;
diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index f16a2921b4b..31db42785da 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "context.h"
 #include "convert.h"
 #include "opts.h"
+#include "tree-vectorizer.h"
 
 /* Describe the OpenACC looping structure of a function.  The entire
    function is held in a 'NULL' loop.  */
@@ -2617,6 +2618,44 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void 
*)
   return NULL_TREE;
 }
 
+/* Compute max_vf for target, and accordingly set loop->safelen and length
+   of omp simd arrays.  */
+
+static void
+adjust_max_vf (function *fun)
+{
+  if (!fun->has_simduid_loops)
+    return;
+
+  poly_uint64 max_vf = omp_max_vf (false);
+
+  /* Since loop->safelen has to be an integer, it's not always possible
+     to compare against poly_int.  For eg 32 and 16+16x are not comparable at
+     compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32 for x >= 2.
+     Even if we could get runtime VL based on -mcpu/-march, that would not be
+     portable across other SVE archs.
+
+     For now, use constant_lower_bound (max_vf), as a "safer approximation" to
+     max_vf that avoids these issues, with the downside that it will be
+     suboptimal max_vf for SVE archs implementing SIMD width > 128 bits.  */
+
+  int max_vf_int = (max_vf.is_constant ())
+                  ? max_vf.to_constant ()
+                  : constant_lower_bound (max_vf);
+
+  bool seen_max_vf_lowering_loop = false;
+  for (auto loop: loops_list (fun, 0))
+    if (loop->needs_max_vf_lowering)
+      {
+       seen_max_vf_lowering_loop = true;
+       if (loop->safelen > max_vf_int)
+         loop->safelen = max_vf_int;
+      }
+
+  if (seen_max_vf_lowering_loop)
+    adjust_simd_arrays_length (fun);
+}
+
 /* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets,
    VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and
    LANE is kept to be expanded to RTL later on.  Also cleanup all other SIMT
@@ -2626,6 +2665,8 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void 
*)
 static unsigned int
 execute_omp_device_lower ()
 {
+  adjust_max_vf (cfun);
+
   int vf = targetm.simt.vf ? targetm.simt.vf () : 1;
   bool regimplify = false;
   basic_block bb;
@@ -2754,7 +2795,10 @@ execute_omp_device_lower ()
            rhs = build_int_cst (type, vf);
            break;
          case IFN_GOMP_MAX_VF:
-           rhs = build_int_cst (type, omp_max_vf (false));
+           /* Use constant_lower_bound, to keep max_vf consistent with
+              adjust_max_vf above.  */
+           rhs = build_int_cst (type,
+                                constant_lower_bound (omp_max_vf (false)));
            break;
          case IFN_GOMP_SIMT_ORDERED_PRED:
            rhs = vf == 1 ? integer_zero_node : NULL_TREE;
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index e83bb903f0d..143549a9831 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -465,6 +465,75 @@ shrink_simd_arrays
 
   delete simd_array_to_simduid_htab;
 }
+
+/* For mapping simduid to corresponding loop->safelen.  */
+class simduid_to_safelen : public free_ptr_hash<simduid_to_safelen>
+{
+public:
+  unsigned int simduid;
+  int safelen;
+
+  /* hash_table support.  */
+  static inline hashval_t hash (const simduid_to_safelen *);
+  static inline int equal (const simduid_to_safelen *,
+                          const simduid_to_safelen *);
+};
+
+inline hashval_t
+simduid_to_safelen::hash (const simduid_to_safelen *p)
+{
+  return p->simduid;
+}
+
+inline int
+simduid_to_safelen::equal (const simduid_to_safelen *p1,
+                          const simduid_to_safelen *p2)
+{
+  return p1->simduid == p2->simduid;
+}
+
+/* Adjust length of arrays with "omp simd array" attribute to corresponding
+   loop->safelen.  */
+
+void
+adjust_simd_arrays_length (function *fun)
+{
+  hash_table<simd_array_to_simduid> *htab = NULL;
+  note_simd_array_uses (&htab, fun);
+  if (!htab)
+    return;
+
+  hash_table<simduid_to_safelen> *simduid_to_safelen_tab
+    = new hash_table<simduid_to_safelen> (15);
+
+  for (auto loop: loops_list (fun, 0))
+    if (loop->needs_max_vf_lowering && loop->simduid)
+      {
+       simduid_to_safelen *entry = XNEW (simduid_to_safelen);
+       entry->simduid = DECL_UID (loop->simduid);
+       entry->safelen = loop->safelen;
+       *simduid_to_safelen_tab->find_slot (entry, INSERT) = entry;
+      }
+
+  for (auto it = htab->begin (); it != htab->end (); ++it)
+    {
+      tree array = (*it)->decl;
+      simduid_to_safelen data;
+      data.simduid = (*it)->simduid;
+      simduid_to_safelen *p = simduid_to_safelen_tab->find (&data);
+      if (p)
+       {
+         tree elt_type = TREE_TYPE (TREE_TYPE (array));
+         tree atype = build_array_type_nelts (elt_type, p->safelen);
+         TREE_TYPE (array) = atype;
+         relayout_decl (array);
+       }
+    }
+
+  delete htab;
+  delete simduid_to_safelen_tab;
+}
+
 
 /* Initialize the vec_info with kind KIND_IN and target cost data
    TARGET_COST_DATA_IN.  */
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 56fce666825..d48523af7be 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2833,4 +2833,5 @@ vect_is_integer_truncation (stmt_vec_info stmt_info)
 /* Build a GIMPLE_ASSIGN or GIMPLE_CALL with the tree_code,
    or internal_fn contained in ch, respectively.  */
 gimple * vect_gimple_build (tree, code_helper, tree, tree = NULL_TREE);
+void adjust_simd_arrays_length (function *);
 #endif  /* GCC_TREE_VECTORIZER_H  */

RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to