On Mon, Dec 02, 2024 at 11:17:08AM +0000, Prathamesh Kulkarni wrote:
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -233,6 +233,12 @@ public:
>       flag_finite_loops or similar pragmas state.  */
>    unsigned finite_p : 1;
>  
> +  /* True if SIMD loop needs delayed lowering of artefacts like
> +     safelen and length of omp simd arrays that depend on target's
> +     max_vf.  This is true for offloading, when max_vf is computed after
> +     streaming out to device.  */
> +  unsigned needs_max_vf_lowering: 1;

Consistency, finite_p above uses space before :, the above line doesn't.

> --- a/gcc/omp-expand.cc
> +++ b/gcc/omp-expand.cc
> @@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region, struct 
> omp_for_data *fd)
>        loop->latch = cont_bb;
>        add_loop (loop, l1_bb->loop_father);
>        loop->safelen = safelen_int;
> +      loop->needs_max_vf_lowering = is_in_offload_region (region);
> +      if (loop->needs_max_vf_lowering)
> +     cfun->curr_properties &= ~PROP_gimple_lomp_dev;

Do you really need this for non-SVE arches?
I mean, could you not set loop->needs_max_vf_lowering if maximum number of
poly_int coeffs is 1?  Or if omp_max_vf returns constant or something
similar?

> --- a/gcc/omp-offload.cc
> +++ b/gcc/omp-offload.cc
> @@ -2617,6 +2617,77 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, 
> void *)
>    return NULL_TREE;
>  }
>  
> +/* Compute max_vf for target, and accordingly set loop->safelen and length
> +   of omp simd arrays.  */
> +
> +static void
> +adjust_max_vf (function *fun)
> +{
> +  if (!fun->has_simduid_loops)
> +    return;
> +
> +  poly_uint64 max_vf = omp_max_vf (false);
> +
> +  /* Since loop->safelen has to be an integer, it's not always possible
> +     to compare against poly_int.  For eg 32 and 16+16x are not comparable at
> +     compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32 for x >= 2.
> +     Even if we could get runtime VL based on -mcpu/-march, that would not be
> +     portable across other SVE archs.
> +
> +     For now, use constant_lower_bound (max_vf), as a "safer approximation" 
> to
> +     max_vf that avoids these issues, with the downside that it will be
> +     suboptimal max_vf for SVE archs implementing SIMD width > 128 bits.  */
> +
> +  uint64_t max_vf_int;
> +  if (!max_vf.is_constant (&max_vf_int))
> +    max_vf_int = constant_lower_bound (max_vf);
> +
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  for (auto loop: loops_list (fun, 0))
> +    {
> +      if (!loop->needs_max_vf_lowering)
> +     continue;
> +
> +      if (loop->safelen > max_vf_int)
> +     loop->safelen = max_vf_int;
> +
> +      basic_block *bbs = get_loop_body (loop);

I still think using the tree-vectorizer.cc infrastructure is much better
here.
There is no guarantee all accesses to the simd arrays will be within the
loop body, in fact, none of them could be there.  Consider e.g. parts of
loop body (in the C meaning) followed by noreturn calls, those aren't
considered loop body in the cfg.
So, I think it is much better to walk the whole function once, not for each
loop walk its loop body (that could be even more expensive if there are
nested needs_max_vf_lowering loops).
And note_simd_array_uses has it all implemented.
Building a mapping between 
So, if you drop above
  calculate_dominance_info (CDI_DOMINATORS);
and when seeing first loop->needs_max_vf_lowering loop perform
note_simd_array_uses (just once per function), for the simduid -> loop
mapping you actually only care which simduid corresponds to
loop->needs_max_vf_lowering loop and which doesn't, so say hash set indexed
by simduid would be good enough.


> +      for (unsigned i = 0; i < loop->num_nodes; i++)
> +     for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next 
> (&gsi))
> +       {
> +         gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi));
> +         tree lhs = NULL_TREE;
> +
> +         if (stmt
> +             && gimple_call_internal_p (stmt)
> +             && (gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
> +             && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE))
> +           {
> +             imm_use_iterator use_iter;
> +             gimple *use_stmt;
> +
> +             FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs)
> +               {
> +                 gassign *assign_stmt = dyn_cast<gassign *> (use_stmt);
> +                 if (!assign_stmt)
> +                   continue;
> +
> +                 tree assign_stmt_lhs = gimple_assign_lhs (assign_stmt);
> +                 if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF)
> +                   {
> +                     tree array = TREE_OPERAND (assign_stmt_lhs, 0);
> +                     tree elt_type = TREE_TYPE (TREE_TYPE (array));
> +                     tree atype = build_array_type_nelts (elt_type,
> +                                                          loop->safelen);
> +                     TREE_TYPE (array) = atype;
> +                     relayout_decl (array);

Appart from the possibility that .GOMP_SIMD_LANE call will end up not in the
loop body, this has also the disadvantage that if there are say 1000
accesses to some simd array, you change the VAR_DECL (new array type,
relayout_decl) it 1000 times.
By just traversing the hash table each one that needs to be modified will be
touched just once.

        Jakub

Reply via email to