Re: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Richard Biener Tue, 15 Oct 2024 04:13:01 -0700

On Tue, 15 Oct 2024, Tamar Christina wrote:

> Hi,
> 
> Thanks for the look,
> 
> The 10/15/2024 09:54, Richard Biener wrote:
> > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > > 
> > > This patch series adds support for a target to do a direct convertion for 
> > > zero
> > > extends using permutes.
> > > 
> > > To do this it uses a target hook use_permute_for_promotio which must be
> > > implemented by targets.  This hook is used to indicate:
> > > 
> > >  1. can a target do this for the given modes.
> > 
> > can_vec_perm_const_p?
> > 
> > >  3. can the target convert between various vector modes with a 
> > > VIEW_CONVERT.
> > 
> > We have modes_tieable_p for this I think.
> > 
> 
> Yes, though the reason I didn't use either of them was because they are 
> reporting
> a capability of the backend.  In which case the hook, which is already backend
> specific already should answer these two.
> 
> I initially had these checks there, but they didn't seem to add value, for
> promotions the masks are only dependent on the input and output modes. So 
> they really
> don't change.
> 
> When you have say a loop that does lots of conversions from say char to int, 
> it seemed
> like a waste to retest the same permute constants over and over again.
> 
> I can add them back in if you prefer...
> 
> > >  2. is it profitable for the target to do it.
> > 
> > So you say the target can do both ways but both zip and tbl are
> > permute instructions so I really fail to see the point and why
> > the target itself doesn't choose to use tbl for unpack.
> > 
> > Is the intent in the end to have VEC_PERM in the IL rather than
> > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > 
> 
> Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
> The reason for exposing them as VEC_PERM was to trigger further optimizations.
> 
> If you remember the ticket about LOAD_LANES, with this optimization and an 
> open
> encoding of LOAD_LANES we stop using it in cases where theres a zero extend 
> after
> the LOAD_LANES, because then you're doing effectively two permutes and the 
> LOAD_LANES
> is no longer beneficial. There are other examples, load and replicate etc.
> 
> > That said, I'm not against supporting VEC_PERM code gen from
> > unsigned promotion but I don't see why we should do this when
> > the target advertises VEC_UNPACK_* support or direct conversion
> > support?
> > 
> > Esp. with adding a "local" cost related hook which cannot take
> > into accout context.
> > 
> 
> To summarize a long story:
> 
>   yes I open encode zero extends as permutes to allow further optimizations.  
> One could convert
>   vec_unpacks to convert optabs and use that, but that is an opague value 
> that can't be further
>   optimized.
> 
>   The hook isn't really a costing thing in the general sense. It's literally 
> just "do you want
>   permutes yes or no".  The reason it gets the modes is simply that I don't 
> think a single level
>   extend is worth it, but I can just change it to never try to do this on 
> more than one level.


When you mention LOAD_LANES we do not expose "permutes" in them on GIMPLE
either, so why should we for VEC_UNPACK_*.

At what level are the simplifications you see happening then?

I do realize we have two ways of expressing zero-extending widenings
(also truncations btw) and that's always bad - so we could decide to
_always_ use VEC_PERMs as the canonical representation because those
combine more easily.  And either match VEC_PERMs back to vec_unpack
at RTL expansion time or require targets to expose those as constant
vec_perms as well.  There are targets like GCN where you can't do
unpacking with permutes of course, so we can't do away with them
(we could possibly force those targets to expose widening/truncation
solely with [us]ext and trunc patterns of course).

> I think think there's a lot of merrit in open-encoding zero extends, but one 
> reason this is
> beneficial on AArch64 for instance is that we can consume the zero register 
> and rewrite the
> indices to a single register TBL.  Two registers TBLs are slower on some 
> implementations.

But this latter fact can be done by optimizing the RTL?

Richard.

> Thanks,
> Tamar
> 
> > > Using permutations have a big benefit of multi-step zero extensions 
> > > because they
> > > both reduce the number of needed instructions, but also increase 
> > > throughput as
> > > the dependency chain is removed.
> > > 
> > > Concretely on AArch64 this changes:
> > > 
> > > void test4(unsigned char *x, long long *y, int n) {
> > >     for(int i = 0; i < n; i++) {
> > >         y[i] = x[i];
> > >     }
> > > }
> > > 
> > > from generating:
> > > 
> > > .L4:
> > >         ldr     q30, [x4], 16
> > >         add     x3, x3, 128
> > >         zip1    v1.16b, v30.16b, v31.16b
> > >         zip2    v30.16b, v30.16b, v31.16b
> > >         zip1    v2.8h, v1.8h, v31.8h
> > >         zip1    v0.8h, v30.8h, v31.8h
> > >         zip2    v1.8h, v1.8h, v31.8h
> > >         zip2    v30.8h, v30.8h, v31.8h
> > >         zip1    v26.4s, v2.4s, v31.4s
> > >         zip1    v29.4s, v0.4s, v31.4s
> > >         zip1    v28.4s, v1.4s, v31.4s
> > >         zip1    v27.4s, v30.4s, v31.4s
> > >         zip2    v2.4s, v2.4s, v31.4s
> > >         zip2    v0.4s, v0.4s, v31.4s
> > >         zip2    v1.4s, v1.4s, v31.4s
> > >         zip2    v30.4s, v30.4s, v31.4s
> > >         stp     q26, q2, [x3, -128]
> > >         stp     q28, q1, [x3, -96]
> > >         stp     q29, q0, [x3, -64]
> > >         stp     q27, q30, [x3, -32]
> > >         cmp     x4, x5
> > >         bne     .L4
> > > 
> > > and instead we get:
> > > 
> > > .L4:
> > >         add     x3, x3, 128
> > >         ldr     q23, [x4], 16
> > >         tbl     v5.16b, {v23.16b}, v31.16b
> > >         tbl     v4.16b, {v23.16b}, v30.16b
> > >         tbl     v3.16b, {v23.16b}, v29.16b
> > >         tbl     v2.16b, {v23.16b}, v28.16b
> > >         tbl     v1.16b, {v23.16b}, v27.16b
> > >         tbl     v0.16b, {v23.16b}, v26.16b
> > >         tbl     v22.16b, {v23.16b}, v25.16b
> > >         tbl     v23.16b, {v23.16b}, v24.16b
> > >         stp     q5, q4, [x3, -128]
> > >         stp     q3, q2, [x3, -96]
> > >         stp     q1, q0, [x3, -64]
> > >         stp     q22, q23, [x3, -32]
> > >         cmp     x4, x5
> > >         bne     .L4
> > > 
> > > Tests are added in the AArch64 patch introducing the hook.  The testsuite 
> > > also
> > > already had about 800 runtime tests that get affected by this.
> > > 
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, 
> > > arm-none-linux-gnueabihf,
> > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > 
> > > Ok for master?
> > > 
> > > Thanks,
> > > Tamar
> > > 
> > > gcc/ChangeLog:
> > > 
> > >   * target.def (use_permute_for_promotion): New.
> > >   * doc/tm.texi.in: Document it.
> > >   * doc/tm.texi: Regenerate.
> > >   * targhooks.cc (default_use_permute_for_promotion): New.
> > >   * targhooks.h (default_use_permute_for_promotion): New.
> > >   (vectorizable_conversion): Support direct convertion with permute.
> > >   * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > >   (supportable_widening_operation): Likewise.
> > >   (vect_gen_perm_mask_any): Allow vector permutes where input registers
> > >   are half the width of the result per the GCC 14 relaxation of
> > >   VEC_PERM_EXPR.
> > > 
> > > ---
> > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > index 
> > > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f731c16ee7eacb78143
> > >  100644
> > > --- a/gcc/doc/tm.texi
> > > +++ b/gcc/doc/tm.texi
> > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered 
> > > expensive when the mask is
> > >  all zeros.  GCC can then try to branch around the instruction instead.
> > >  @end deftypefn
> > >  
> > > +@deftypefn {Target Hook} bool TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION 
> > > (const_tree @var{in_type}, const_tree @var{out_type})
> > > +This hook returns true if the operation promoting @var{in_type} to
> > > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > > +a signed type the operation will be done as the related unsigned type and
> > > +converted to @var{out_type}.  If the target supports the needed permute,
> > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > > +beneficial to the hook should return true, else false should be returned.
> > > +@end deftypefn
> > > +
> > >  @deftypefn {Target Hook} {class vector_costs *} 
> > > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool 
> > > @var{costing_for_scalar})
> > >  This hook should initialize target-specific data structures in 
> > > preparation
> > >  for modeling the costs of vectorizing a loop or basic block.  The default
> > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > index 
> > > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52d29b76f5bc283a1
> > >  100644
> > > --- a/gcc/doc/tm.texi.in
> > > +++ b/gcc/doc/tm.texi.in
> > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy 
> > > can generate better code.
> > >  
> > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > >  
> > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > +
> > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > >  
> > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index 
> > > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f4db9f2636973598
> > >  100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the 
> > > instruction instead.",
> > >   (unsigned ifn),
> > >   default_empty_mask_is_expensive)
> > >  
> > > +/* Function to say whether a target supports and prefers to use permutes 
> > > for
> > > +   zero extensions or truncates.  */
> > > +DEFHOOK
> > > +(use_permute_for_promotion,
> > > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > > +@var{out_type} should be done as a vector permute.  If @var{out_type} 
> > > is\n\
> > > +a signed type the operation will be done as the related unsigned type 
> > > and\n\
> > > +converted to @var{out_type}.  If the target supports the needed 
> > > permute,\n\
> > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it 
> > > is\n\
> > > +beneficial to the hook should return true, else false should be 
> > > returned.",
> > > + bool,
> > > + (const_tree in_type, const_tree out_type),
> > > + default_use_permute_for_promotion)
> > > +
> > >  /* Target builtin that implements vector gather operation.  */
> > >  DEFHOOK
> > >  (builtin_gather,
> > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > index 
> > > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b3fafad74d3c536f
> > >  100644
> > > --- a/gcc/targhooks.h
> > > +++ b/gcc/targhooks.h
> > > @@ -124,6 +124,7 @@ extern opt_machine_mode 
> > > default_vectorize_related_mode (machine_mode,
> > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > >  extern bool default_empty_mask_is_expensive (unsigned);
> > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> > >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> > >  
> > >  /* OpenACC hooks.  */
> > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > index 
> > > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdfc881fdb19d28f3
> > >  100644
> > > --- a/gcc/targhooks.cc
> > > +++ b/gcc/targhooks.cc
> > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive 
> > > (unsigned ifn)
> > >    return ifn == IFN_MASK_STORE;
> > >  }
> > >  
> > > +/* By default no targets prefer permutes over multi step extension.  */
> > > +
> > > +bool
> > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > +{
> > > +  return false;
> > > +}
> > > +
> > >  /* By default consider masked stores to be expensive.  */
> > >  
> > >  bool
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index 
> > > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894eaf769d29b1c5b82
> > >  100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts (vec_info 
> > > *vinfo,
> > >    gimple *new_stmt1, *new_stmt2;
> > >    vec<tree> vec_tmp = vNULL;
> > >  
> > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final 
> > > type in
> > > +     one go.  */
> > > +  if (ch1 == VEC_PERM_EXPR
> > > +      && op_type == unary_op)
> > > +    {
> > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > +      bool failed_p = false;
> > > +
> > > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > + {
> > > +   tree vectype_in = TREE_TYPE (vop0);
> > > +   tree vectype_out = TREE_TYPE (vec_dest);
> > > +   machine_mode mode_in = TYPE_MODE (vectype_in);
> > > +   machine_mode mode_out = TYPE_MODE (vectype_out);
> > > +   unsigned bitsize_in = element_precision (vectype_in);
> > > +   unsigned tot_in, tot_out;
> > > +   unsigned HOST_WIDE_INT count;
> > > +
> > > +   /* We can't really support VLA here as the indexes depend on the VL.
> > > +      VLA should really use widening instructions like widening
> > > +      loads.  */
> > > +   if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > +       || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > +       || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > > +       || !TYPE_UNSIGNED (vectype_in)
> > > +       || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > +                                                        vectype_out))
> > > +     {
> > > +       failed_p = true;
> > > +       break;
> > > +     }
> > > +
> > > +   unsigned steps = tot_out / bitsize_in;
> > > +   tree zero = build_zero_cst (vectype_in);
> > > +
> > > +   unsigned chunk_size
> > > +     = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > +                  TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > > +   unsigned step_size = chunk_size * (tot_out / tot_in);
> > > +   unsigned nunits = tot_out / bitsize_in;
> > > +
> > > +   vec_perm_builder sel (steps, 1, 1);
> > > +   sel.quick_grow (steps);
> > > +
> > > +   /* Flood fill with the out of range value first.  */
> > > +   for (unsigned long i = 0; i < steps; ++i)
> > > +     sel[i] = count;
> > > +
> > > +   tree var;
> > > +   tree elem_in = TREE_TYPE (vectype_in);
> > > +   machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > +   unsigned long idx = 0;
> > > +   tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > > +                                                     elem_in, nunits);
> > > +
> > > +   for (unsigned long j = 0; j < chunk_size; j++)
> > > +     {
> > > +       if (WORDS_BIG_ENDIAN)
> > > +         for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > +           sel[i] = idx;
> > > +       else
> > > +         for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > +           sel[i] = idx;
> > > +
> > > +       vec_perm_indices indices (sel, 2, steps);
> > > +
> > > +       tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > > +       auto vec_oprnd = make_ssa_name (vc_in);
> > > +       auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > > +                                            vop0, zero, perm_mask);
> > > +       vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > +
> > > +       tree intvect_out = unsigned_type_for (vectype_out);
> > > +       var = make_ssa_name (intvect_out);
> > > +       new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > > +                                                    intvect_out,
> > > +                                                    vec_oprnd));
> > > +       vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > +
> > > +       gcc_assert (ch2.is_tree_code ());
> > > +
> > > +       var = make_ssa_name (vectype_out);
> > > +       if (ch2 == VIEW_CONVERT_EXPR)
> > > +           new_stmt = gimple_build_assign (var,
> > > +                                           build1 (VIEW_CONVERT_EXPR,
> > > +                                                   vectype_out,
> > > +                                                   vec_oprnd));
> > > +       else
> > > +           new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > +                                           vec_oprnd);
> > > +
> > > +       vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > +       vec_tmp.safe_push (var);
> > > +     }
> > > + }
> > > +
> > > +      if (!failed_p)
> > > + {
> > > +   vec_oprnds0->release ();
> > > +   *vec_oprnds0 = vec_tmp;
> > > +   return;
> > > + }
> > > +    }
> > > +
> > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > >      {
> > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > >     || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > >   goto unsupported;
> > >  
> > > +      /* Check to see if the target can use a permute to perform the zero
> > > +  extension.  */
> > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > +      if (TYPE_UNSIGNED (vectype_in)
> > > +   && VECTOR_TYPE_P (intermediate_type)
> > > +   && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > +   && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > +                                                   intermediate_type))
> > > + {
> > > +   code1 = VEC_PERM_EXPR;
> > > +   code2 = FLOAT_EXPR;
> > > +   break;
> > > + }
> > > +
> > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > >   {
> > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const 
> > > vec_perm_indices &sel)
> > >    tree mask_type;
> > >  
> > >    poly_uint64 nunits = sel.length ();
> > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > +       || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> > >  
> > >    mask_type = build_vector_type (ssizetype, nunits);
> > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info *vinfo,
> > >        break;
> > >  
> > >      CASE_CONVERT:
> > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > +      {
> > > + tree cvt_type = unsigned_type_for (vectype_out);
> > > + if (TYPE_UNSIGNED (vectype_in)
> > > +   && VECTOR_TYPE_P (cvt_type)
> > > +   && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > +   && targetm.vectorize.use_permute_for_promotion (vectype_in, cvt_type))
> > > +   {
> > > +     *code1 = VEC_PERM_EXPR;
> > > +     *code2 = VIEW_CONVERT_EXPR;
> > > +     return true;
> > > +   }
> > > + c1 = VEC_UNPACK_LO_EXPR;
> > > + c2 = VEC_UNPACK_HI_EXPR;
> > > +      }
> > >        break;
> > >  
> > >      case FLOAT_EXPR:
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > -- 
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Reply via email to