RE: [PATCH][RFC] Add x86 subadd SLP pattern

Tamar Christina via Gcc-patches Thu, 17 Jun 2021 03:25:57 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Thursday, June 17, 2021 10:45 AM
> To: gcc-patches@gcc.gnu.org
> Cc: hongtao....@intel.com; ubiz...@gmail.com; Tamar Christina
> <tamar.christ...@arm.com>
> Subject: [PATCH][RFC] Add x86 subadd SLP pattern
> 
> This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } thus 
> subtract,
> add alternating on lanes, starting with subtract.
> 
> It adds a corresponding optab and direct internal function,
> vec_subadd$a3 and at the moment to make the i386 backend changes
> "obvious", duplicates the existing avx_addsubv4df3 pattern with the new
> canonical name (CODE_FOR_* and gen_* are used throughout the intrinsic
> code, so the actual change to rename all existing patterns will be quite a bit
> bigger).  I expect some bike-shedding on subadd vs. addsub so I delay that
> change ;)
> 
> ARC seems to have both addsub{V2HI,V2SI,V4HI} and
> subadd{V2HI,V2SI,V4HI}.  ARM iwmmxt has wsubaddx on V4HI but I didn't
> dare to decipher whether it's -, + or +, -.  bfin has both variants as well 
> plus a
> saturating variant of both (on v2hi).
> But none of the major vector ISAs besides x86 seem to have it.
> Still some targets having both and opting for the naming I choose would
> make that a better fit IMHO.


In terms of internal_fn I think we'll need both, whatever the target expands to 
is
Up to it.  But the reason for both is that in the when you get the bigger cases 
the difference
Between ADDSUB and SUBADD determines whether it's a conjugate or not. 
(I believe x86 has those with fmaddsub and fmsubadd) the rework just simply
Matches MUL + ADDSUB or SUBADD and never needs to look further.

This keeps stacking, to more complicated sequences like complex dot product.

> 
> Uros/Hontao - would mass-renaming of the x86 addsub patterns to subadd
> be OK?  (not touching the FMA ones which have both)
> 
> The SLP pattern matches the exact alternating lane sequence rather than
> trying to be clever and anticipating incoming permutes - we could permute
> the two input vectors to the needed lane alternation, do the subadd and
> then permute the result vector back but that's only profitable in case the two
> input or the output permute will vanish - something Tamars refactoring of
> SLP pattern recog should make possible (I hope ;)).

It will, it pushes it to optimize_slp which will allow you to decide what to do.
As part of this it allows additional transformations. So atm this where I do
ADDSUB -> COMPLEX_ADD depending on the permute it sees.

Regards,
Tamar

> 
> So as said, the patch is incomplete on the x86 backend (and test
> coverage) side, but the vectorizer part should be finalized until Tamar posts
> his work.
> 
> Bootstrapped / tested on x86_64-unknown-linux-gnu.
> 
> 2021-06-17  Richard Biener  <rguent...@suse.de>
> 
>       * config/i386/sse.md (vec_subaddv4df3): Duplicate from
>       avx_addsubv4df3.
>       * internal-fn.def (VEC_SUBADD): New internal optab fn.
>       * optabs.def (vec_subadd_optab): New optab.
>       * tree-vect-slp-patterns.c (class subadd_pattern): New.
>       (slp_patterns): Add subadd_pattern.
>       * tree-vectorizer.h (vect_pattern::vect_pattern): Make
>       m_ops optional.
> 
>       * gcc.target/i386/vect-subadd-1.c: New testcase.
> ---
>  gcc/config/i386/sse.md                        | 16 +++
>  gcc/internal-fn.def                           |  1 +
>  gcc/optabs.def                                |  1 +
>  gcc/testsuite/gcc.target/i386/vect-subadd-1.c | 35 +++++++
>  gcc/tree-vect-slp-patterns.c                  | 98 +++++++++++++++++++
>  gcc/tree-vectorizer.h                         |  3 +-
>  6 files changed, 153 insertions(+), 1 deletion(-)  create mode 100644
> gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> 94296bc773b..73238c9874a 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2396,6 +2396,22 @@
>     (set_attr "prefix" "<round_saeonly_scalar_prefix>")
>     (set_attr "mode" "<ssescalarmode>")])
> 
> +/* ???  Rename avx_addsubv4df3, but its name is used throughout the
> backed,
> +   so avoid this churn for the moment.  */ (define_insn
> +"vec_subaddv4df3"
> +  [(set (match_operand:V4DF 0 "register_operand" "=x")
> +     (vec_merge:V4DF
> +       (minus:V4DF
> +         (match_operand:V4DF 1 "register_operand" "x")
> +         (match_operand:V4DF 2 "nonimmediate_operand" "xm"))
> +       (plus:V4DF (match_dup 1) (match_dup 2))
> +       (const_int 5)))]
> +  "TARGET_AVX"
> +  "vaddsubpd\t{%2, %1, %0|%0, %1, %2}"
> +  [(set_attr "type" "sseadd")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "V4DF")])
> +
>  (define_insn "avx_addsubv4df3"
>    [(set (match_operand:V4DF 0 "register_operand" "=x")
>       (vec_merge:V4DF
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> b2f414d2131..a4578a0293e 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90,
> ECF_CONST, cadd90, binary)  DEF_INTERNAL_OPTAB_FN
> (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
> DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj,
> binary)
> +DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd,
> binary)
> 
> 
>  /* FP scales.  */
> diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..cacbd02cfe1
> 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab,
> "vec_widen_usubl_hi_$a")  OPTAB_D (vec_widen_usubl_lo_optab,
> "vec_widen_usubl_lo_$a")  OPTAB_D (vec_widen_uaddl_hi_optab,
> "vec_widen_uaddl_hi_$a")  OPTAB_D (vec_widen_uaddl_lo_optab,
> "vec_widen_uaddl_lo_$a")
> +OPTAB_D (vec_subadd_optab, "vec_subadd$a3")
> 
>  OPTAB_D (sync_add_optab, "sync_add$I$a")  OPTAB_D (sync_and_optab,
> "sync_and$I$a") diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> new file mode 100644
> index 00000000000..a79e5ba92e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> @@ -0,0 +1,35 @@
> +/* { dg-do run { target avx_runtime } } */
> +/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */
> +
> +double x[4], y[4], z[4];
> +void __attribute__((noipa)) foo ()
> +{
> +  x[0] = y[0] - z[0];
> +  x[1] = y[1] + z[1];
> +  x[2] = y[2] - z[2];
> +  x[3] = y[3] + z[3];
> +}
> +void __attribute__((noipa)) bar ()
> +{
> +  x[0] = y[0] + z[0];
> +  x[1] = y[1] - z[1];
> +  x[2] = y[2] + z[2];
> +  x[3] = y[3] - z[3];
> +}
> +int main()
> +{
> +  for (int i = 0; i < 4; ++i)
> +    {
> +      y[i] = i + 1;
> +      z[i] = 2 * i + 1;
> +    }
> +  foo ();
> +  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
> +    __builtin_abort ();
> +  bar ();
> +  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index
> 2ed49cd9edc..61732c53ad7 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -1490,6 +1490,103 @@ complex_operations_pattern::build (vec_info *
> /* vinfo */)
>    gcc_unreachable ();
>  }
> 
> +
> +/* The subadd_pattern.  */
> +
> +class subadd_pattern : public vect_pattern {
> +  public:
> +    subadd_pattern (slp_tree *node)
> +     : vect_pattern (node, NULL, IFN_VEC_SUBADD) {};
> +
> +    void build (vec_info *);
> +
> +    static vect_pattern*
> +    recognize (slp_tree_to_load_perm_map_t *, slp_tree *); };
> +
> +vect_pattern *
> +subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree
> +*node_) {
> +  slp_tree node = *node_;
> +  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> +      || SLP_TREE_CHILDREN (node).length () != 2)
> +    return NULL;
> +
> +  /* Match a blend of a plus and a minus op with the same number of plus
> and
> +     minus lanes on the same operands.  */
> +  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
> +  slp_tree add = SLP_TREE_CHILDREN (node)[1];
> +  bool swapped_p = false;
> +  if (vect_match_expression_p (sub, PLUS_EXPR))
> +    {
> +      std::swap (add, sub);
> +      swapped_p = true;
> +    }
> +  if (!(vect_match_expression_p (add, PLUS_EXPR)
> +     && vect_match_expression_p (sub, MINUS_EXPR)))
> +    return NULL;
> +  if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0]
> +      && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1])
> +     || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1]
> +         && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN
> (add)[0])))
> +    return NULL;
> +
> +  for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length ();
> ++i)
> +    {
> +      std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION
> (node)[i];
> +      if (swapped_p)
> +     perm.first = perm.first == 0 ? 1 : 0;
> +      /* It has to be alternating -, +, -, ...
> +      While we could permute the .SUBADD inputs and the .SUBADD
> output
> +      that's only profitable over the add + sub + blend if at least
> +      one of the permute is optimized which we can't determine here.  */
> +      if (perm.first != (i & 1)
> +       || perm.second != i)
> +     return NULL;
> +    }
> +
> +  if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node))
> +    return NULL;
> +
> +  return new subadd_pattern (node_);
> +}
> +
> +void
> +subadd_pattern::build (vec_info *vinfo) {
> +  slp_tree node = *m_node;
> +
> +  slp_tree sub = SLP_TREE_CHILDREN (node)[0];  slp_tree add =
> + SLP_TREE_CHILDREN (node)[1];  if (vect_match_expression_p (sub,
> + PLUS_EXPR))
> +    std::swap (add, sub);
> +
> +  /* Modify the blend node in-place.  */  SLP_TREE_CHILDREN (node)[0] =
> + SLP_TREE_CHILDREN (sub)[0];  SLP_TREE_CHILDREN (node)[1] =
> + SLP_TREE_CHILDREN (sub)[1];  SLP_TREE_REF_COUNT
> (SLP_TREE_CHILDREN
> + (node)[0])++;  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++;
> +
> +  /* Build IFN_VEC_SUBADD from the sub representative operands.  */
> +  stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub);
> +  gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2,
> +                                         gimple_assign_rhs1 (rep->stmt),
> +                                         gimple_assign_rhs2 (rep->stmt));
> +  gimple_call_set_lhs (call, make_ssa_name
> +                            (TREE_TYPE (gimple_assign_lhs (rep->stmt))));
> +  gimple_call_set_nothrow (call, true);
> +  gimple_set_bb (call, gimple_bb (rep->stmt));
> +  SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep);
> +  STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) =
> +vect_used_in_scope;
> +  STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp;
> +  SLP_TREE_CODE (node) = ERROR_MARK;
> +  SLP_TREE_LANE_PERMUTATION (node).release ();
> +
> +  vect_free_slp_tree (sub);
> +  vect_free_slp_tree (add);
> +}
> +
> 
> /**********************************************************
> *********************
>   * Pattern matching definitions
> 
> **********************************************************
> ********************/
> @@ -1502,6 +1599,7 @@ vect_pattern_decl_t slp_patterns[]
>       overlap in what they can detect.  */
> 
>    SLP_PATTERN (complex_operations_pattern),
> +  SLP_PATTERN (subadd_pattern)
>  };
>  #undef SLP_PATTERN
> 
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> 04c20f8bd0f..b9824623ad9 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2100,7 +2100,8 @@ class vect_pattern
>        this->m_ifn = ifn;
>        this->m_node = node;
>        this->m_ops.create (0);
> -      this->m_ops.safe_splice (*m_ops);
> +      if (m_ops)
> +     this->m_ops.safe_splice (*m_ops);
>      }
> 
>    public:
> --
> 2.26.2

RE: [PATCH][RFC] Add x86 subadd SLP pattern

Reply via email to