> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Monday, October 14, 2024 7:34 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using > VEC_PERM_EXPR > > Tamar Christina <tamar.christ...@arm.com> writes: > > Hi All, > > > > This patch series adds support for a target to do a direct convertion for > > zero > > extends using permutes. > > > > To do this it uses a target hook use_permute_for_promotio which must be > > implemented by targets. This hook is used to indicate: > > > > 1. can a target do this for the given modes. > > 2. is it profitable for the target to do it. > > 3. can the target convert between various vector modes with a VIEW_CONVERT. > > > > Using permutations have a big benefit of multi-step zero extensions because > they > > both reduce the number of needed instructions, but also increase throughput > > as > > the dependency chain is removed. > > > > Concretely on AArch64 this changes: > > > > void test4(unsigned char *x, long long *y, int n) { > > for(int i = 0; i < n; i++) { > > y[i] = x[i]; > > } > > } > > > > from generating: > > > > .L4: > > ldr q30, [x4], 16 > > add x3, x3, 128 > > zip1 v1.16b, v30.16b, v31.16b > > zip2 v30.16b, v30.16b, v31.16b > > zip1 v2.8h, v1.8h, v31.8h > > zip1 v0.8h, v30.8h, v31.8h > > zip2 v1.8h, v1.8h, v31.8h > > zip2 v30.8h, v30.8h, v31.8h > > zip1 v26.4s, v2.4s, v31.4s > > zip1 v29.4s, v0.4s, v31.4s > > zip1 v28.4s, v1.4s, v31.4s > > zip1 v27.4s, v30.4s, v31.4s > > zip2 v2.4s, v2.4s, v31.4s > > zip2 v0.4s, v0.4s, v31.4s > > zip2 v1.4s, v1.4s, v31.4s > > zip2 v30.4s, v30.4s, v31.4s > > stp q26, q2, [x3, -128] > > stp q28, q1, [x3, -96] > > stp q29, q0, [x3, -64] > > stp q27, q30, [x3, -32] > > cmp x4, x5 > > bne .L4 > > > > and instead we get: > > > > .L4: > > add x3, x3, 128 > > ldr q23, [x4], 16 > > tbl v5.16b, {v23.16b}, v31.16b > > tbl v4.16b, {v23.16b}, v30.16b > > tbl v3.16b, {v23.16b}, v29.16b > > tbl v2.16b, {v23.16b}, v28.16b > > tbl v1.16b, {v23.16b}, v27.16b > > tbl v0.16b, {v23.16b}, v26.16b > > tbl v22.16b, {v23.16b}, v25.16b > > tbl v23.16b, {v23.16b}, v24.16b > > stp q5, q4, [x3, -128] > > stp q3, q2, [x3, -96] > > stp q1, q0, [x3, -64] > > stp q22, q23, [x3, -32] > > cmp x4, x5 > > bne .L4 > > > > Tests are added in the AArch64 patch introducing the hook. The testsuite > > also > > already had about 800 runtime tests that get affected by this. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf, > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * target.def (use_permute_for_promotion): New. > > * doc/tm.texi.in: Document it. > > * doc/tm.texi: Regenerate. > > * targhooks.cc (default_use_permute_for_promotion): New. > > * targhooks.h (default_use_permute_for_promotion): New. > > (vectorizable_conversion): Support direct convertion with permute. > > * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise. > > (supportable_widening_operation): Likewise. > > (vect_gen_perm_mask_any): Allow vector permutes where input registers > > are half the width of the result per the GCC 14 relaxation of > > VEC_PERM_EXPR. > > > > --- > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > > index > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73 > 1c16ee7eacb78143 100644 > > --- a/gcc/doc/tm.texi > > +++ b/gcc/doc/tm.texi > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered > expensive when the mask is > > all zeros. GCC can then try to branch around the instruction instead. > > @end deftypefn > > > > +@deftypefn {Target Hook} bool > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree > @var{in_type}, const_tree @var{out_type}) > > +This hook returns true if the operation promoting @var{in_type} to > > +@var{out_type} should be done as a vector permute. If @var{out_type} is > > +a signed type the operation will be done as the related unsigned type and > > +converted to @var{out_type}. If the target supports the needed permute, > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is > > +beneficial to the hook should return true, else false should be returned. > > +@end deftypefn > > Just a review of the documentation, but: is a two-step process really > necessary for signed out_types? I thought it could be done directly, > since it's in_type rather than out_type that determines the type of > extension.
Thanks!, I think this is an indication the text is ambiguous. The intention was to say that if out_type is signed, we still keep the type as signed, but insert an intermediate cast to (unsigned type(out_type)). The optimization only looks at in_type as you correctly point out. I think you're right in that the documentation is explaining too much of how the optimization does the transform, rather than explaining just the transform. Would it be clearer is I just delete the > If @var{out_type} is > > +a signed type the operation will be done as the related unsigned type and > > +converted to @var{out_type}. Part? Thanks for raising this :) Thanks, Tamar > > Thanks, > Richard > > > + > > @deftypefn {Target Hook} {class vector_costs *} > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool > @var{costing_for_scalar}) > > This hook should initialize target-specific data structures in preparation > > for modeling the costs of vectorizing a loop or basic block. The default > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > > index > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52 > d29b76f5bc283a1 100644 > > --- a/gcc/doc/tm.texi.in > > +++ b/gcc/doc/tm.texi.in > > @@ -4303,6 +4303,8 @@ address; but often a machine-dependent strategy > can generate better code. > > > > @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION > > + > > @hook TARGET_VECTORIZE_CREATE_COSTS > > > > @hook TARGET_VECTORIZE_BUILTIN_GATHER > > diff --git a/gcc/target.def b/gcc/target.def > > index > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f > 4db9f2636973598 100644 > > --- a/gcc/target.def > > +++ b/gcc/target.def > > @@ -2056,6 +2056,20 @@ all zeros. GCC can then try to branch around the > instruction instead.", > > (unsigned ifn), > > default_empty_mask_is_expensive) > > > > +/* Function to say whether a target supports and prefers to use permutes > > for > > + zero extensions or truncates. */ > > +DEFHOOK > > +(use_permute_for_promotion, > > + "This hook returns true if the operation promoting @var{in_type} to\n\ > > +@var{out_type} should be done as a vector permute. If @var{out_type} is\n\ > > +a signed type the operation will be done as the related unsigned type > > and\n\ > > +converted to @var{out_type}. If the target supports the needed permute,\n\ > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\ > > +beneficial to the hook should return true, else false should be returned.", > > + bool, > > + (const_tree in_type, const_tree out_type), > > + default_use_permute_for_promotion) > > + > > /* Target builtin that implements vector gather operation. */ > > DEFHOOK > > (builtin_gather, > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h > > index > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b > 3fafad74d3c536f 100644 > > --- a/gcc/targhooks.h > > +++ b/gcc/targhooks.h > > @@ -124,6 +124,7 @@ extern opt_machine_mode > default_vectorize_related_mode (machine_mode, > > extern opt_machine_mode default_get_mask_mode (machine_mode); > > extern bool default_empty_mask_is_expensive (unsigned); > > extern bool default_conditional_operation_is_expensive (unsigned); > > +extern bool default_use_permute_for_promotion (const_tree, const_tree); > > extern vector_costs *default_vectorize_create_costs (vec_info *, bool); > > > > /* OpenACC hooks. */ > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc > > index > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf > c881fdb19d28f3 100644 > > --- a/gcc/targhooks.cc > > +++ b/gcc/targhooks.cc > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive > (unsigned ifn) > > return ifn == IFN_MASK_STORE; > > } > > > > +/* By default no targets prefer permutes over multi step extension. */ > > + > > +bool > > +default_use_permute_for_promotion (const_tree, const_tree) > > +{ > > + return false; > > +} > > + > > /* By default consider masked stores to be expensive. */ > > > > bool > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e > af769d29b1c5b82 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts > (vec_info *vinfo, > > gimple *new_stmt1, *new_stmt2; > > vec<tree> vec_tmp = vNULL; > > > > + /* If we're using a VEC_PERM_EXPR then we're widening to the final type > > in > > + one go. */ > > + if (ch1 == VEC_PERM_EXPR > > + && op_type == unary_op) > > + { > > + vec_tmp.create (vec_oprnds0->length () * 2); > > + bool failed_p = false; > > + > > + /* Extending with a vec-perm requires 2 instructions per step. */ > > + FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0) > > + { > > + tree vectype_in = TREE_TYPE (vop0); > > + tree vectype_out = TREE_TYPE (vec_dest); > > + machine_mode mode_in = TYPE_MODE (vectype_in); > > + machine_mode mode_out = TYPE_MODE (vectype_out); > > + unsigned bitsize_in = element_precision (vectype_in); > > + unsigned tot_in, tot_out; > > + unsigned HOST_WIDE_INT count; > > + > > + /* We can't really support VLA here as the indexes depend on the VL. > > + VLA should really use widening instructions like widening > > + loads. */ > > + if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in) > > + || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out) > > + || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count) > > + || !TYPE_UNSIGNED (vectype_in) > > + || !targetm.vectorize.use_permute_for_promotion (vectype_in, > > + vectype_out)) > > + { > > + failed_p = true; > > + break; > > + } > > + > > + unsigned steps = tot_out / bitsize_in; > > + tree zero = build_zero_cst (vectype_in); > > + > > + unsigned chunk_size > > + = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in), > > + TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant (); > > + unsigned step_size = chunk_size * (tot_out / tot_in); > > + unsigned nunits = tot_out / bitsize_in; > > + > > + vec_perm_builder sel (steps, 1, 1); > > + sel.quick_grow (steps); > > + > > + /* Flood fill with the out of range value first. */ > > + for (unsigned long i = 0; i < steps; ++i) > > + sel[i] = count; > > + > > + tree var; > > + tree elem_in = TREE_TYPE (vectype_in); > > + machine_mode elem_mode_in = TYPE_MODE (elem_in); > > + unsigned long idx = 0; > > + tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in, > > + elem_in, nunits); > > + > > + for (unsigned long j = 0; j < chunk_size; j++) > > + { > > + if (WORDS_BIG_ENDIAN) > > + for (int i = steps - 1; i >= 0; i -= step_size, idx++) > > + sel[i] = idx; > > + else > > + for (int i = 0; i < (int)steps; i += step_size, idx++) > > + sel[i] = idx; > > + > > + vec_perm_indices indices (sel, 2, steps); > > + > > + tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices); > > + auto vec_oprnd = make_ssa_name (vc_in); > > + auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR, > > + vop0, zero, perm_mask); > > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); > > + > > + tree intvect_out = unsigned_type_for (vectype_out); > > + var = make_ssa_name (intvect_out); > > + new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR, > > + intvect_out, > > + vec_oprnd)); > > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); > > + > > + gcc_assert (ch2.is_tree_code ()); > > + > > + var = make_ssa_name (vectype_out); > > + if (ch2 == VIEW_CONVERT_EXPR) > > + new_stmt = gimple_build_assign (var, > > + build1 (VIEW_CONVERT_EXPR, > > + vectype_out, > > + vec_oprnd)); > > + else > > + new_stmt = gimple_build_assign (var, (tree_code)ch2, > > + vec_oprnd); > > + > > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); > > + vec_tmp.safe_push (var); > > + } > > + } > > + > > + if (!failed_p) > > + { > > + vec_oprnds0->release (); > > + *vec_oprnds0 = vec_tmp; > > + return; > > + } > > + } > > + > > vec_tmp.create (vec_oprnds0->length () * 2); > > FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0) > > { > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo, > > || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode)) > > goto unsupported; > > > > + /* Check to see if the target can use a permute to perform the zero > > + extension. */ > > + intermediate_type = unsigned_type_for (vectype_out); > > + if (TYPE_UNSIGNED (vectype_in) > > + && VECTOR_TYPE_P (intermediate_type) > > + && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant () > > + && targetm.vectorize.use_permute_for_promotion (vectype_in, > > + intermediate_type)) > > + { > > + code1 = VEC_PERM_EXPR; > > + code2 = FLOAT_EXPR; > > + break; > > + } > > + > > fltsz = GET_MODE_SIZE (lhs_mode); > > FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode) > > { > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const > vec_perm_indices &sel) > > tree mask_type; > > > > poly_uint64 nunits = sel.length (); > > - gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))); > > + gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)) > > + || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2)); > > > > mask_type = build_vector_type (ssizetype, nunits); > > return vec_perm_indices_to_tree (mask_type, sel); > > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info > *vinfo, > > break; > > > > CASE_CONVERT: > > - c1 = VEC_UNPACK_LO_EXPR; > > - c2 = VEC_UNPACK_HI_EXPR; > > + { > > + tree cvt_type = unsigned_type_for (vectype_out); > > + if (TYPE_UNSIGNED (vectype_in) > > + && VECTOR_TYPE_P (cvt_type) > > + && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant () > > + && targetm.vectorize.use_permute_for_promotion (vectype_in, > cvt_type)) > > + { > > + *code1 = VEC_PERM_EXPR; > > + *code2 = VIEW_CONVERT_EXPR; > > + return true; > > + } > > + c1 = VEC_UNPACK_LO_EXPR; > > + c2 = VEC_UNPACK_HI_EXPR; > > + } > > break; > > > > case FLOAT_EXPR: