RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Wed, 08 Jan 2025 01:52:46 -0800

> -----Original Message-----
> From: Gcc <gcc-bounces~prathameshk=nvidia....@gcc.gnu.org> On Behalf
> Of Prathamesh Kulkarni via Gcc
> Sent: 27 December 2024 18:00
> To: Jakub Jelinek <ja...@redhat.com>
> Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> <richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
> gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> > -----Original Message-----
> > From: Jakub Jelinek <ja...@redhat.com>
> > Sent: 17 December 2024 19:09
> > To: Prathamesh Kulkarni <prathame...@nvidia.com>
> > Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
> > <richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
> > gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, Dec 02, 2024 at 11:17:08AM +0000, Prathamesh Kulkarni wrote:
> > > --- a/gcc/cfgloop.h
> > > +++ b/gcc/cfgloop.h
> > > @@ -233,6 +233,12 @@ public:
> > >       flag_finite_loops or similar pragmas state.  */
> > >    unsigned finite_p : 1;
> > >
> > > +  /* True if SIMD loop needs delayed lowering of artefacts like
> > > +     safelen and length of omp simd arrays that depend on
> target's
> > > +     max_vf.  This is true for offloading, when max_vf is
> computed
> > after
> > > +     streaming out to device.  */
> > > +  unsigned needs_max_vf_lowering: 1;
> >
> > Consistency, finite_p above uses space before :, the above line
> > doesn't.
> >
> > > --- a/gcc/omp-expand.cc
> > > +++ b/gcc/omp-expand.cc
> > > @@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region,
> > struct omp_for_data *fd)
> > >        loop->latch = cont_bb;
> > >        add_loop (loop, l1_bb->loop_father);
> > >        loop->safelen = safelen_int;
> > > +      loop->needs_max_vf_lowering = is_in_offload_region
> (region);
> > > +      if (loop->needs_max_vf_lowering)
> > > +     cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> >
> > Do you really need this for non-SVE arches?
> > I mean, could you not set loop->needs_max_vf_lowering if maximum
> > number of poly_int coeffs is 1?  Or if omp_max_vf returns constant
> or
> > something similar?
> Well, I guess the issue is not really about VLA vectors but when host
> and device have different max_vf, and selecting optimal max_vf is not
> really possible during omp-low/omp-expand, since we don't have
> device's target info available at this point. Andrew's recent patch
> works around this limitation by searching for "amdgcn" in
> OFFLOAD_TARGET_NAMES in omp_max_vf, but I guess a more general
> solution would be to delay lowering max_vf after streaming-out to
> device irrespective of VLA/VLS vectors ?
> For AArch64/nvptx offloading with SVE, where host is VLA and device is
> VLS, the issue is more pronounced (failing to compile), compared to
> offloading from VLS host to VLS device (selecting sub-optimal max_vf).
> >
> > > --- a/gcc/omp-offload.cc
> > > +++ b/gcc/omp-offload.cc
> > > @@ -2617,6 +2617,77 @@ find_simtpriv_var_op (tree *tp, int
> > *walk_subtrees, void *)
> > >    return NULL_TREE;
> > >  }
> > >
> > > +/* Compute max_vf for target, and accordingly set loop->safelen
> and
> > length
> > > +   of omp simd arrays.  */
> > > +
> > > +static void
> > > +adjust_max_vf (function *fun)
> > > +{
> > > +  if (!fun->has_simduid_loops)
> > > +    return;
> > > +
> > > +  poly_uint64 max_vf = omp_max_vf (false);
> > > +
> > > +  /* Since loop->safelen has to be an integer, it's not always
> > possible
> > > +     to compare against poly_int.  For eg 32 and 16+16x are not
> > comparable at
> > > +     compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32
> > for x >= 2.
> > > +     Even if we could get runtime VL based on -mcpu/-march, that
> > would not be
> > > +     portable across other SVE archs.
> > > +
> > > +     For now, use constant_lower_bound (max_vf), as a "safer
> > approximation" to
> > > +     max_vf that avoids these issues, with the downside that it
> > will be
> > > +     suboptimal max_vf for SVE archs implementing SIMD width >
> 128
> > > + bits.  */
> > > +
> > > +  uint64_t max_vf_int;
> > > +  if (!max_vf.is_constant (&max_vf_int))
> > > +    max_vf_int = constant_lower_bound (max_vf);
> > > +
> > > +  calculate_dominance_info (CDI_DOMINATORS);  for (auto loop:
> > > + loops_list (fun, 0))
> > > +    {
> > > +      if (!loop->needs_max_vf_lowering)
> > > +     continue;
> > > +
> > > +      if (loop->safelen > max_vf_int)
> > > +     loop->safelen = max_vf_int;
> > > +
> > > +      basic_block *bbs = get_loop_body (loop);
> >
> > I still think using the tree-vectorizer.cc infrastructure is much
> > better here.
> > There is no guarantee all accesses to the simd arrays will be within
> > the loop body, in fact, none of them could be there.  Consider e.g.
> > parts of loop body (in the C meaning) followed by noreturn calls,
> > those aren't considered loop body in the cfg.
> > So, I think it is much better to walk the whole function once, not
> for
> > each loop walk its loop body (that could be even more expensive if
> > there are nested needs_max_vf_lowering loops).
> > And note_simd_array_uses has it all implemented.
> > Building a mapping between
> > So, if you drop above
> >   calculate_dominance_info (CDI_DOMINATORS); and when seeing first
> > loop->needs_max_vf_lowering loop perform note_simd_array_uses (just
> > once per function), for the simduid -> loop mapping you actually
> only
> > care which simduid corresponds to
> > loop->needs_max_vf_lowering loop and which doesn't, so say hash set
> > loop->indexed
> > by simduid would be good enough.
> Thanks for the suggestions. The attached patch uses
> note_simd_array_uses to build omp simd array -> simduid mapping, and
> creates a new mapping from simduid -> safelen, and uses it to adjust
> length of omp simd array.
> Does it look OK ?
> 
> While working on the patch, I stumbled on an issue with
> note_simd_array_uses, which I have filed as PR118200, and posted patch
> for it:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-December/672269.html
> 
> Patch (combined with PR118200 patch) passes libgomp testing with
> AArch64/nvptx offloading (with/without SVE) and bootstrap+test on
> aarch64-linux-gnu.
Hi, ping: https://gcc.gnu.org/pipermail/gcc/2024-December/245284.html

Thanks,
Prathamesh
> 
> Thanks,
> Prathamesh
> >
> >
> > > +      for (unsigned i = 0; i < loop->num_nodes; i++)
> > > +     for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi);
> > gsi_next (&gsi))
> > > +       {
> > > +         gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi));
> > > +         tree lhs = NULL_TREE;
> > > +
> > > +         if (stmt
> > > +             && gimple_call_internal_p (stmt)
> > > +             && (gimple_call_internal_fn (stmt) ==
> > IFN_GOMP_SIMD_LANE)
> > > +             && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE))
> > > +           {
> > > +             imm_use_iterator use_iter;
> > > +             gimple *use_stmt;
> > > +
> > > +             FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs)
> > > +               {
> > > +                 gassign *assign_stmt = dyn_cast<gassign *>
> > (use_stmt);
> > > +                 if (!assign_stmt)
> > > +                   continue;
> > > +
> > > +                 tree assign_stmt_lhs = gimple_assign_lhs
> > (assign_stmt);
> > > +                 if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF)
> > > +                   {
> > > +                     tree array = TREE_OPERAND (assign_stmt_lhs,
> > 0);
> > > +                     tree elt_type = TREE_TYPE (TREE_TYPE
> (array));
> > > +                     tree atype = build_array_type_nelts
> (elt_type,
> > > +                                                          loop-
> > >safelen);
> > > +                     TREE_TYPE (array) = atype;
> > > +                     relayout_decl (array);
> >
> > Appart from the possibility that .GOMP_SIMD_LANE call will end up
> not
> > in the loop body, this has also the disadvantage that if there are
> say
> > 1000 accesses to some simd array, you change the VAR_DECL (new array
> > type,
> > relayout_decl) it 1000 times.
> > By just traversing the hash table each one that needs to be modified
> > will be touched just once.
> >
> >         Jakub
RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to