Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

Richard Sandiford via Gcc-patches Wed, 12 Jul 2023 07:15:45 -0700

Richard Biener <rguent...@suse.de> writes:
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
>   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.


Yeah, the current RTL codes probably overlap a bit too much.

Maybe we should have a rule that a vec_merge with a constant
third operand should be canonicalised to a vec_select?  And maybe
change the first operand of vec_select to be an rtvec, so that
no separate vec_concat (and thus wider mode) is needed for two-input
permutes?  Would be a lot of work though...

> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).

Yeah, extraction of the low element (0 for LE, N-1 for BE) is special
in RTL, in that it is now folded to a subreg.  But IMO it's reasonable
for even that case to through TARGET_VECTORIZE_VEC_PERM_CONST,
maybe with a target-independent helper function to match permute
vectors that are equivalent to extract-and-insert.

On AArch64, extract-and-insert is a single operation for other
elements too, e.g.:

        ins     v0.s[2], v1.s[1]

is a thing.  But if the helper returns the index of the extracted
elements, targets can decide for themselves whether the index is
supported or not.

Agree that this is the right thing for gimple to do FWIW.

Thanks,
Richard

> But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> now realizes that
>
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
>
>   <bb 3> [local count: 536870913]:
>
>   <bb 4> [local count: 1073741824]:
>   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
>   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
>
> is equal to
>
>   <bb 2> [local count: 1073741824]:
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 3>; [50.00%]
>
>   <bb 3> [local count: 536870912]:
>   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
>
>   <bb 4> [local count: 1073741824]:
>   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
>
> and that no longer produces the desired maxsd operation at the RTL
> level (we fail to match .FMAX at the GIMPLE level earlier).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
>
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
>
> I think this is also PR88540 (the lack of min/max detection, not
> sure if the SSE min/max are suitable here)
>
>       PR tree-optimization/94864
>       PR tree-optimization/94865
>       * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
>       for vector insertion from vector extraction.
>
>       * gcc.target/i386/pr94864.c: New testcase.
>       * gcc.target/i386/pr94865.c: Likewise.
> ---
>  gcc/match.pd                            | 25 +++++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8543f777a28..8cc106049c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>                     wi::to_wide (@ipos) + isize))
>      (BIT_FIELD_REF @0 @rsize @rpos)))))
>  
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +      && types_match (@0, @1)
> +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> +     unsigned HOST_WIDE_INT elsz
> +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> +     vec_perm_builder builder;
> +     builder.new_vector (nunits, nunits, 1);
> +     for (unsigned i = 0; i < nunits; ++i)
> +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> +     vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +     || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> false))
> +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> +                        (build_vector_type (ssizetype, nunits), sel); })))))
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>    (simplify
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c 
> b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 00000000000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[0] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c 
> b/gcc/testsuite/gcc.target/i386/pr94865.c
> new file mode 100644
> index 00000000000..84065ac2467
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[1] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

Reply via email to