RE: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Tamar Christina Mon, 14 Oct 2024 11:45:41 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Monday, October 14, 2024 7:34 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de
> Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> VEC_PERM_EXPR
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > Hi All,
> >
> > This patch series adds support for a target to do a direct convertion for 
> > zero
> > extends using permutes.
> >
> > To do this it uses a target hook use_permute_for_promotio which must be
> > implemented by targets.  This hook is used to indicate:
> >
> >  1. can a target do this for the given modes.
> >  2. is it profitable for the target to do it.
> >  3. can the target convert between various vector modes with a VIEW_CONVERT.
> >
> > Using permutations have a big benefit of multi-step zero extensions because
> they
> > both reduce the number of needed instructions, but also increase throughput 
> > as
> > the dependency chain is removed.
> >
> > Concretely on AArch64 this changes:
> >
> > void test4(unsigned char *x, long long *y, int n) {
> >     for(int i = 0; i < n; i++) {
> >         y[i] = x[i];
> >     }
> > }
> >
> > from generating:
> >
> > .L4:
> >         ldr     q30, [x4], 16
> >         add     x3, x3, 128
> >         zip1    v1.16b, v30.16b, v31.16b
> >         zip2    v30.16b, v30.16b, v31.16b
> >         zip1    v2.8h, v1.8h, v31.8h
> >         zip1    v0.8h, v30.8h, v31.8h
> >         zip2    v1.8h, v1.8h, v31.8h
> >         zip2    v30.8h, v30.8h, v31.8h
> >         zip1    v26.4s, v2.4s, v31.4s
> >         zip1    v29.4s, v0.4s, v31.4s
> >         zip1    v28.4s, v1.4s, v31.4s
> >         zip1    v27.4s, v30.4s, v31.4s
> >         zip2    v2.4s, v2.4s, v31.4s
> >         zip2    v0.4s, v0.4s, v31.4s
> >         zip2    v1.4s, v1.4s, v31.4s
> >         zip2    v30.4s, v30.4s, v31.4s
> >         stp     q26, q2, [x3, -128]
> >         stp     q28, q1, [x3, -96]
> >         stp     q29, q0, [x3, -64]
> >         stp     q27, q30, [x3, -32]
> >         cmp     x4, x5
> >         bne     .L4
> >
> > and instead we get:
> >
> > .L4:
> >         add     x3, x3, 128
> >         ldr     q23, [x4], 16
> >         tbl     v5.16b, {v23.16b}, v31.16b
> >         tbl     v4.16b, {v23.16b}, v30.16b
> >         tbl     v3.16b, {v23.16b}, v29.16b
> >         tbl     v2.16b, {v23.16b}, v28.16b
> >         tbl     v1.16b, {v23.16b}, v27.16b
> >         tbl     v0.16b, {v23.16b}, v26.16b
> >         tbl     v22.16b, {v23.16b}, v25.16b
> >         tbl     v23.16b, {v23.16b}, v24.16b
> >         stp     q5, q4, [x3, -128]
> >         stp     q3, q2, [x3, -96]
> >         stp     q1, q0, [x3, -64]
> >         stp     q22, q23, [x3, -32]
> >         cmp     x4, x5
> >         bne     .L4
> >
> > Tests are added in the AArch64 patch introducing the hook.  The testsuite 
> > also
> > already had about 800 runtime tests that get affected by this.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     * target.def (use_permute_for_promotion): New.
> >     * doc/tm.texi.in: Document it.
> >     * doc/tm.texi: Regenerate.
> >     * targhooks.cc (default_use_permute_for_promotion): New.
> >     * targhooks.h (default_use_permute_for_promotion): New.
> >     (vectorizable_conversion): Support direct convertion with permute.
> >     * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> >     (supportable_widening_operation): Likewise.
> >     (vect_gen_perm_mask_any): Allow vector permutes where input registers
> >     are half the width of the result per the GCC 14 relaxation of
> >     VEC_PERM_EXPR.
> >
> > ---
> >
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index
> 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> 1c16ee7eacb78143 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
> expensive when the mask is
> >  all zeros.  GCC can then try to branch around the instruction instead.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} bool
> TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> @var{in_type}, const_tree @var{out_type})
> > +This hook returns true if the operation promoting @var{in_type} to
> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}.  If the target supports the needed permute,
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > +beneficial to the hook should return true, else false should be returned.
> > +@end deftypefn
> 
> Just a review of the documentation, but: is a two-step process really
> necessary for signed out_types?  I thought it could be done directly,
> since it's in_type rather than out_type that determines the type of
> extension.


Thanks!,

I think this is an indication the text is ambiguous.  The intention was to say
that if out_type is signed, we still keep the type as signed, but insert an
intermediate cast to (unsigned type(out_type)).

The optimization only looks at in_type as you correctly point out.
I think you're right in that the documentation is explaining too much of
how the optimization does the transform, rather than explaining just
the transform.

Would it be clearer is I just delete the 

> If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}.  

Part?

Thanks for raising this :)

Thanks,
Tamar
> 
> Thanks,
> Richard
> 
> > +
> >  @deftypefn {Target Hook} {class vector_costs *}
> TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> @var{costing_for_scalar})
> >  This hook should initialize target-specific data structures in preparation
> >  for modeling the costs of vectorizing a loop or basic block.  The default
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index
> 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> d29b76f5bc283a1 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy
> can generate better code.
> >
> >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> >
> > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > +
> >  @hook TARGET_VECTORIZE_CREATE_COSTS
> >
> >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > diff --git a/gcc/target.def b/gcc/target.def
> > index
> b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> 4db9f2636973598 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the
> instruction instead.",
> >   (unsigned ifn),
> >   default_empty_mask_is_expensive)
> >
> > +/* Function to say whether a target supports and prefers to use permutes 
> > for
> > +   zero extensions or truncates.  */
> > +DEFHOOK
> > +(use_permute_for_promotion,
> > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
> > +a signed type the operation will be done as the related unsigned type 
> > and\n\
> > +converted to @var{out_type}.  If the target supports the needed permute,\n\
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > +beneficial to the hook should return true, else false should be returned.",
> > + bool,
> > + (const_tree in_type, const_tree out_type),
> > + default_use_permute_for_promotion)
> > +
> >  /* Target builtin that implements vector gather operation.  */
> >  DEFHOOK
> >  (builtin_gather,
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index
> 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> 3fafad74d3c536f 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -124,6 +124,7 @@ extern opt_machine_mode
> default_vectorize_related_mode (machine_mode,
> >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> >  extern bool default_empty_mask_is_expensive (unsigned);
> >  extern bool default_conditional_operation_is_expensive (unsigned);
> > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> >
> >  /* OpenACC hooks.  */
> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > index
> dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> c881fdb19d28f3 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> (unsigned ifn)
> >    return ifn == IFN_MASK_STORE;
> >  }
> >
> > +/* By default no targets prefer permutes over multi step extension.  */
> > +
> > +bool
> > +default_use_permute_for_promotion (const_tree, const_tree)
> > +{
> > +  return false;
> > +}
> > +
> >  /* By default consider masked stores to be expensive.  */
> >
> >  bool
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index
> 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> af769d29b1c5b82 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> (vec_info *vinfo,
> >    gimple *new_stmt1, *new_stmt2;
> >    vec<tree> vec_tmp = vNULL;
> >
> > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type 
> > in
> > +     one go.  */
> > +  if (ch1 == VEC_PERM_EXPR
> > +      && op_type == unary_op)
> > +    {
> > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > +      bool failed_p = false;
> > +
> > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > +   {
> > +     tree vectype_in = TREE_TYPE (vop0);
> > +     tree vectype_out = TREE_TYPE (vec_dest);
> > +     machine_mode mode_in = TYPE_MODE (vectype_in);
> > +     machine_mode mode_out = TYPE_MODE (vectype_out);
> > +     unsigned bitsize_in = element_precision (vectype_in);
> > +     unsigned tot_in, tot_out;
> > +     unsigned HOST_WIDE_INT count;
> > +
> > +     /* We can't really support VLA here as the indexes depend on the VL.
> > +        VLA should really use widening instructions like widening
> > +        loads.  */
> > +     if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > +         || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > +         || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > +         || !TYPE_UNSIGNED (vectype_in)
> > +         || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > +                                                          vectype_out))
> > +       {
> > +         failed_p = true;
> > +         break;
> > +       }
> > +
> > +     unsigned steps = tot_out / bitsize_in;
> > +     tree zero = build_zero_cst (vectype_in);
> > +
> > +     unsigned chunk_size
> > +       = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > +                    TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > +     unsigned step_size = chunk_size * (tot_out / tot_in);
> > +     unsigned nunits = tot_out / bitsize_in;
> > +
> > +     vec_perm_builder sel (steps, 1, 1);
> > +     sel.quick_grow (steps);
> > +
> > +     /* Flood fill with the out of range value first.  */
> > +     for (unsigned long i = 0; i < steps; ++i)
> > +       sel[i] = count;
> > +
> > +     tree var;
> > +     tree elem_in = TREE_TYPE (vectype_in);
> > +     machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > +     unsigned long idx = 0;
> > +     tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > +                                                       elem_in, nunits);
> > +
> > +     for (unsigned long j = 0; j < chunk_size; j++)
> > +       {
> > +         if (WORDS_BIG_ENDIAN)
> > +           for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > +             sel[i] = idx;
> > +         else
> > +           for (int i = 0; i < (int)steps; i += step_size, idx++)
> > +             sel[i] = idx;
> > +
> > +         vec_perm_indices indices (sel, 2, steps);
> > +
> > +         tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > +         auto vec_oprnd = make_ssa_name (vc_in);
> > +         auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > +                                              vop0, zero, perm_mask);
> > +         vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > +         tree intvect_out = unsigned_type_for (vectype_out);
> > +         var = make_ssa_name (intvect_out);
> > +         new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > +                                                      intvect_out,
> > +                                                      vec_oprnd));
> > +         vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > +         gcc_assert (ch2.is_tree_code ());
> > +
> > +         var = make_ssa_name (vectype_out);
> > +         if (ch2 == VIEW_CONVERT_EXPR)
> > +             new_stmt = gimple_build_assign (var,
> > +                                             build1 (VIEW_CONVERT_EXPR,
> > +                                                     vectype_out,
> > +                                                     vec_oprnd));
> > +         else
> > +             new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > +                                             vec_oprnd);
> > +
> > +         vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +         vec_tmp.safe_push (var);
> > +       }
> > +   }
> > +
> > +      if (!failed_p)
> > +   {
> > +     vec_oprnds0->release ();
> > +     *vec_oprnds0 = vec_tmp;
> > +     return;
> > +   }
> > +    }
> > +
> >    vec_tmp.create (vec_oprnds0->length () * 2);
> >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> >      {
> > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> >       || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> >     goto unsupported;
> >
> > +      /* Check to see if the target can use a permute to perform the zero
> > +    extension.  */
> > +      intermediate_type = unsigned_type_for (vectype_out);
> > +      if (TYPE_UNSIGNED (vectype_in)
> > +     && VECTOR_TYPE_P (intermediate_type)
> > +     && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > +     && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > +                                                     intermediate_type))
> > +   {
> > +     code1 = VEC_PERM_EXPR;
> > +     code2 = FLOAT_EXPR;
> > +     break;
> > +   }
> > +
> >        fltsz = GET_MODE_SIZE (lhs_mode);
> >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> >     {
> > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const
> vec_perm_indices &sel)
> >    tree mask_type;
> >
> >    poly_uint64 nunits = sel.length ();
> > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > +         || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> >
> >    mask_type = build_vector_type (ssizetype, nunits);
> >    return vec_perm_indices_to_tree (mask_type, sel);
> > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info
> *vinfo,
> >        break;
> >
> >      CASE_CONVERT:
> > -      c1 = VEC_UNPACK_LO_EXPR;
> > -      c2 = VEC_UNPACK_HI_EXPR;
> > +      {
> > +   tree cvt_type = unsigned_type_for (vectype_out);
> > +   if (TYPE_UNSIGNED (vectype_in)
> > +     && VECTOR_TYPE_P (cvt_type)
> > +     && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > +     && targetm.vectorize.use_permute_for_promotion (vectype_in,
> cvt_type))
> > +     {
> > +       *code1 = VEC_PERM_EXPR;
> > +       *code2 = VIEW_CONVERT_EXPR;
> > +       return true;
> > +     }
> > +   c1 = VEC_UNPACK_LO_EXPR;
> > +   c2 = VEC_UNPACK_HI_EXPR;
> > +      }
> >        break;
> >
> >      case FLOAT_EXPR:

RE: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Reply via email to