> -----Original Message-----
> From: Richard Sandiford <[email protected]>
> Sent: Monday, October 14, 2024 7:34 PM
> To: Tamar Christina <[email protected]>
> Cc: [email protected]; nd <[email protected]>; [email protected]
> Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> VEC_PERM_EXPR
>
> Tamar Christina <[email protected]> writes:
> > Hi All,
> >
> > This patch series adds support for a target to do a direct convertion for
> > zero
> > extends using permutes.
> >
> > To do this it uses a target hook use_permute_for_promotio which must be
> > implemented by targets. This hook is used to indicate:
> >
> > 1. can a target do this for the given modes.
> > 2. is it profitable for the target to do it.
> > 3. can the target convert between various vector modes with a VIEW_CONVERT.
> >
> > Using permutations have a big benefit of multi-step zero extensions because
> they
> > both reduce the number of needed instructions, but also increase throughput
> > as
> > the dependency chain is removed.
> >
> > Concretely on AArch64 this changes:
> >
> > void test4(unsigned char *x, long long *y, int n) {
> > for(int i = 0; i < n; i++) {
> > y[i] = x[i];
> > }
> > }
> >
> > from generating:
> >
> > .L4:
> > ldr q30, [x4], 16
> > add x3, x3, 128
> > zip1 v1.16b, v30.16b, v31.16b
> > zip2 v30.16b, v30.16b, v31.16b
> > zip1 v2.8h, v1.8h, v31.8h
> > zip1 v0.8h, v30.8h, v31.8h
> > zip2 v1.8h, v1.8h, v31.8h
> > zip2 v30.8h, v30.8h, v31.8h
> > zip1 v26.4s, v2.4s, v31.4s
> > zip1 v29.4s, v0.4s, v31.4s
> > zip1 v28.4s, v1.4s, v31.4s
> > zip1 v27.4s, v30.4s, v31.4s
> > zip2 v2.4s, v2.4s, v31.4s
> > zip2 v0.4s, v0.4s, v31.4s
> > zip2 v1.4s, v1.4s, v31.4s
> > zip2 v30.4s, v30.4s, v31.4s
> > stp q26, q2, [x3, -128]
> > stp q28, q1, [x3, -96]
> > stp q29, q0, [x3, -64]
> > stp q27, q30, [x3, -32]
> > cmp x4, x5
> > bne .L4
> >
> > and instead we get:
> >
> > .L4:
> > add x3, x3, 128
> > ldr q23, [x4], 16
> > tbl v5.16b, {v23.16b}, v31.16b
> > tbl v4.16b, {v23.16b}, v30.16b
> > tbl v3.16b, {v23.16b}, v29.16b
> > tbl v2.16b, {v23.16b}, v28.16b
> > tbl v1.16b, {v23.16b}, v27.16b
> > tbl v0.16b, {v23.16b}, v26.16b
> > tbl v22.16b, {v23.16b}, v25.16b
> > tbl v23.16b, {v23.16b}, v24.16b
> > stp q5, q4, [x3, -128]
> > stp q3, q2, [x3, -96]
> > stp q1, q0, [x3, -64]
> > stp q22, q23, [x3, -32]
> > cmp x4, x5
> > bne .L4
> >
> > Tests are added in the AArch64 patch introducing the hook. The testsuite
> > also
> > already had about 800 runtime tests that get affected by this.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > * target.def (use_permute_for_promotion): New.
> > * doc/tm.texi.in: Document it.
> > * doc/tm.texi: Regenerate.
> > * targhooks.cc (default_use_permute_for_promotion): New.
> > * targhooks.h (default_use_permute_for_promotion): New.
> > (vectorizable_conversion): Support direct convertion with permute.
> > * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > (supportable_widening_operation): Likewise.
> > (vect_gen_perm_mask_any): Allow vector permutes where input registers
> > are half the width of the result per the GCC 14 relaxation of
> > VEC_PERM_EXPR.
> >
> > ---
> >
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index
> 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> 1c16ee7eacb78143 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
> expensive when the mask is
> > all zeros. GCC can then try to branch around the instruction instead.
> > @end deftypefn
> >
> > +@deftypefn {Target Hook} bool
> TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> @var{in_type}, const_tree @var{out_type})
> > +This hook returns true if the operation promoting @var{in_type} to
> > +@var{out_type} should be done as a vector permute. If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}. If the target supports the needed permute,
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > +beneficial to the hook should return true, else false should be returned.
> > +@end deftypefn
>
> Just a review of the documentation, but: is a two-step process really
> necessary for signed out_types? I thought it could be done directly,
> since it's in_type rather than out_type that determines the type of
> extension.
Thanks!,
I think this is an indication the text is ambiguous. The intention was to say
that if out_type is signed, we still keep the type as signed, but insert an
intermediate cast to (unsigned type(out_type)).
The optimization only looks at in_type as you correctly point out.
I think you're right in that the documentation is explaining too much of
how the optimization does the transform, rather than explaining just
the transform.
Would it be clearer is I just delete the
> If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}.
Part?
Thanks for raising this :)
Thanks,
Tamar
>
> Thanks,
> Richard
>
> > +
> > @deftypefn {Target Hook} {class vector_costs *}
> TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> @var{costing_for_scalar})
> > This hook should initialize target-specific data structures in preparation
> > for modeling the costs of vectorizing a loop or basic block. The default
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index
> 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> d29b76f5bc283a1 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4303,6 +4303,8 @@ address; but often a machine-dependent strategy
> can generate better code.
> >
> > @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> >
> > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > +
> > @hook TARGET_VECTORIZE_CREATE_COSTS
> >
> > @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > diff --git a/gcc/target.def b/gcc/target.def
> > index
> b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> 4db9f2636973598 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2056,6 +2056,20 @@ all zeros. GCC can then try to branch around the
> instruction instead.",
> > (unsigned ifn),
> > default_empty_mask_is_expensive)
> >
> > +/* Function to say whether a target supports and prefers to use permutes
> > for
> > + zero extensions or truncates. */
> > +DEFHOOK
> > +(use_permute_for_promotion,
> > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > +@var{out_type} should be done as a vector permute. If @var{out_type} is\n\
> > +a signed type the operation will be done as the related unsigned type
> > and\n\
> > +converted to @var{out_type}. If the target supports the needed permute,\n\
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > +beneficial to the hook should return true, else false should be returned.",
> > + bool,
> > + (const_tree in_type, const_tree out_type),
> > + default_use_permute_for_promotion)
> > +
> > /* Target builtin that implements vector gather operation. */
> > DEFHOOK
> > (builtin_gather,
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index
> 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> 3fafad74d3c536f 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -124,6 +124,7 @@ extern opt_machine_mode
> default_vectorize_related_mode (machine_mode,
> > extern opt_machine_mode default_get_mask_mode (machine_mode);
> > extern bool default_empty_mask_is_expensive (unsigned);
> > extern bool default_conditional_operation_is_expensive (unsigned);
> > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> > extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> >
> > /* OpenACC hooks. */
> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > index
> dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> c881fdb19d28f3 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> (unsigned ifn)
> > return ifn == IFN_MASK_STORE;
> > }
> >
> > +/* By default no targets prefer permutes over multi step extension. */
> > +
> > +bool
> > +default_use_permute_for_promotion (const_tree, const_tree)
> > +{
> > + return false;
> > +}
> > +
> > /* By default consider masked stores to be expensive. */
> >
> > bool
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index
> 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> af769d29b1c5b82 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> (vec_info *vinfo,
> > gimple *new_stmt1, *new_stmt2;
> > vec<tree> vec_tmp = vNULL;
> >
> > + /* If we're using a VEC_PERM_EXPR then we're widening to the final type
> > in
> > + one go. */
> > + if (ch1 == VEC_PERM_EXPR
> > + && op_type == unary_op)
> > + {
> > + vec_tmp.create (vec_oprnds0->length () * 2);
> > + bool failed_p = false;
> > +
> > + /* Extending with a vec-perm requires 2 instructions per step. */
> > + FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > + {
> > + tree vectype_in = TREE_TYPE (vop0);
> > + tree vectype_out = TREE_TYPE (vec_dest);
> > + machine_mode mode_in = TYPE_MODE (vectype_in);
> > + machine_mode mode_out = TYPE_MODE (vectype_out);
> > + unsigned bitsize_in = element_precision (vectype_in);
> > + unsigned tot_in, tot_out;
> > + unsigned HOST_WIDE_INT count;
> > +
> > + /* We can't really support VLA here as the indexes depend on the VL.
> > + VLA should really use widening instructions like widening
> > + loads. */
> > + if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > + || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > + || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > + || !TYPE_UNSIGNED (vectype_in)
> > + || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > + vectype_out))
> > + {
> > + failed_p = true;
> > + break;
> > + }
> > +
> > + unsigned steps = tot_out / bitsize_in;
> > + tree zero = build_zero_cst (vectype_in);
> > +
> > + unsigned chunk_size
> > + = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > + TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > + unsigned step_size = chunk_size * (tot_out / tot_in);
> > + unsigned nunits = tot_out / bitsize_in;
> > +
> > + vec_perm_builder sel (steps, 1, 1);
> > + sel.quick_grow (steps);
> > +
> > + /* Flood fill with the out of range value first. */
> > + for (unsigned long i = 0; i < steps; ++i)
> > + sel[i] = count;
> > +
> > + tree var;
> > + tree elem_in = TREE_TYPE (vectype_in);
> > + machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > + unsigned long idx = 0;
> > + tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > + elem_in, nunits);
> > +
> > + for (unsigned long j = 0; j < chunk_size; j++)
> > + {
> > + if (WORDS_BIG_ENDIAN)
> > + for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > + sel[i] = idx;
> > + else
> > + for (int i = 0; i < (int)steps; i += step_size, idx++)
> > + sel[i] = idx;
> > +
> > + vec_perm_indices indices (sel, 2, steps);
> > +
> > + tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > + auto vec_oprnd = make_ssa_name (vc_in);
> > + auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > + vop0, zero, perm_mask);
> > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > + tree intvect_out = unsigned_type_for (vectype_out);
> > + var = make_ssa_name (intvect_out);
> > + new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > + intvect_out,
> > + vec_oprnd));
> > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > + gcc_assert (ch2.is_tree_code ());
> > +
> > + var = make_ssa_name (vectype_out);
> > + if (ch2 == VIEW_CONVERT_EXPR)
> > + new_stmt = gimple_build_assign (var,
> > + build1 (VIEW_CONVERT_EXPR,
> > + vectype_out,
> > + vec_oprnd));
> > + else
> > + new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > + vec_oprnd);
> > +
> > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > + vec_tmp.safe_push (var);
> > + }
> > + }
> > +
> > + if (!failed_p)
> > + {
> > + vec_oprnds0->release ();
> > + *vec_oprnds0 = vec_tmp;
> > + return;
> > + }
> > + }
> > +
> > vec_tmp.create (vec_oprnds0->length () * 2);
> > FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > {
> > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > goto unsupported;
> >
> > + /* Check to see if the target can use a permute to perform the zero
> > + extension. */
> > + intermediate_type = unsigned_type_for (vectype_out);
> > + if (TYPE_UNSIGNED (vectype_in)
> > + && VECTOR_TYPE_P (intermediate_type)
> > + && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > + && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > + intermediate_type))
> > + {
> > + code1 = VEC_PERM_EXPR;
> > + code2 = FLOAT_EXPR;
> > + break;
> > + }
> > +
> > fltsz = GET_MODE_SIZE (lhs_mode);
> > FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > {
> > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const
> vec_perm_indices &sel)
> > tree mask_type;
> >
> > poly_uint64 nunits = sel.length ();
> > - gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > + gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > + || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> >
> > mask_type = build_vector_type (ssizetype, nunits);
> > return vec_perm_indices_to_tree (mask_type, sel);
> > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info
> *vinfo,
> > break;
> >
> > CASE_CONVERT:
> > - c1 = VEC_UNPACK_LO_EXPR;
> > - c2 = VEC_UNPACK_HI_EXPR;
> > + {
> > + tree cvt_type = unsigned_type_for (vectype_out);
> > + if (TYPE_UNSIGNED (vectype_in)
> > + && VECTOR_TYPE_P (cvt_type)
> > + && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > + && targetm.vectorize.use_permute_for_promotion (vectype_in,
> cvt_type))
> > + {
> > + *code1 = VEC_PERM_EXPR;
> > + *code2 = VIEW_CONVERT_EXPR;
> > + return true;
> > + }
> > + c1 = VEC_UNPACK_LO_EXPR;
> > + c2 = VEC_UNPACK_HI_EXPR;
> > + }
> > break;
> >
> > case FLOAT_EXPR: